Making sense of Digital and Physical Documents using ML and Optical Character Recognition

Jul 2017

24 Mon

25 Tue

26 Wed

27 Thu 08:15 AM – 10:00 PM IST

28 Fri 08:15 AM – 06:25 PM IST

29 Sat

30 Sun

MLR Convention Centre, Whitefield, Bengaluru,

Making sense of Digital and Physical Documents using ML and Optical Character Recognition

Submitted Jun 10, 2017

Section: Full talk for data engineering track Technical level: Intermediate

Have you ever wondered what could you do with the piece of paper that you have at hand when you make a purchase at your local grocery store, get your car’s tank full, see a doctor when you are ill, go to a loan provider to get a quick loan and much more!

People usually throw away this piece of paper called a receipt or bill, thinking its just another junk paper.

But our startup has realized that it’s no more a piece of paper and has the key to untapped data that can empower consumers and business equally and help in intelligent decision making, forecasting, boosting sales, improving diagnosis and much more.

In this session we are going to share how Optical Character Recognition and Machine Learning are going to have a significant impact in making sense of the digital data which is not directly available in a format that can be processed in a ML pipeline.

As of now no single solution exists which can tackle the problem of extracting data from such physical or digital documents and applying Machine Learning to bring out intelligent insights.

Outline

This will be a walkthrough and demo session where in raw data will be sourced in real time and the different pieces of the solution will be discussed namely -

Optical Character Recognition
Data extraction and Text Parsing
Data Storage
Getting Predictions and Recommendations out of ML models
End user Visualization

The softwares used are mainly open source or free like Python, Mysql, Redis, Spark 2.0, OCR engine used is ABBYY FineReader and we will spend 5 minutes on this.

We will pick 3 use cases from Retail, Finance and Healthcare and demonstrate how the solution works.

Retail -- 10 min

From a Consumer’s Perspective -

You just bought some groceries and 10 days later you realize sugar is finished and then the sudden awakening, did you even buy sugar in your last purchase!

If you are single, maybe some days can be spent basking in glory of being healthy by the cut down on sugar but if you are married you dont have a choice :) ! Guilt ridden you buy 2 times the usual quantity and even if its the usual quantity it will last longer and will be an inventory at your home.

Now, there are multiple items that you buy in a single visit and at SKU level (stock keeping unit) at the grocery store.

The business problem we are trying to solve is - Is there someway you can keep track of your purchases based out of the information pulled from your grocery bills and let the data speak up as to when and at what SKU level would you need to purchase grocery products and is there a possibility to perhaps optimize monthly spending.

Finance -- 10 min

Lets think from a Business perspective -

Suppose you are a Lending Company and there is an untapped market for lending in various segments like small ticket loans, non-salaried people etc. There are very few sources of judging the loan eligibility of the individual like monthly bank account statement, any existing loans documents and others.

As an average sized lending company you will need to make sense of monthly bank account statements of atleast 20-30 folks (at a city level, conservatively) on a daily basis and any errors in manual scanning and interpretation of documents, means lost business and bad reputation.

Take this to Tier2, Tier3 cities in a developing economy say India and you have a massive untapped pool of data, ready to be explored.

The busines problem we are trying to solve is - Is there some intelligent way to extract data from these documents and apply Machine Learning models to determine the loan eligibility of the individual and do his risk profiling.

Healthcare --10 min

Another striking use case is in the Healthcare industry -

It is well known that a patient’s history is very important in assessing his ailments and deciding upon an optimal course of treatment.And the patients try to preserve as much history as possible but is it really possible to preserve
a 10 year old prescription or a 5 year old report in a a physical format, which can give the doctors a fair amount of idea about patients history.

The challenge in India is there is no one single repository that houses all historical data of an individual.The onus of preservation of reports and prescription lies on the individual and preserving physical files is a challenge in these days. Hence, invariably some history is lost.

Add to it that an idividual may visit different hospitals for different or maybe same problems and has multiple physical prescriptions/reports from all these hospitals.

Suddenly the individual develops some complication and he is referred to some other hospital maybe in a different city or location.

The business problem we are trying to solve is - Is there some way we can build a central repo (keeping the security and integrity intact) about the patient’s ailments and can chart out patient’s entire history from those physical prescriptions/reports, to enable doctors better diagnose and prescribe a superior course of treatment.

Q&A -- 5 min

Requirements

This will be a walkthrough session wherein the workflow will be explained and a demo will be shown. Since the solution runs as a mobile app, only internet connectivity will be required and all the components will be explained.

Speaker bio

Nitin Saraswat is a Data Scientist and works as an independent Consultant advising orgranizations on the Big Data Machine Learning Stack for solving business problems. Nitin is passionate about technology enabled disruptions and believes ML is not be a job eroder rather an opportunities creator.

Slides

https://www.slideshare.net/chunkybaba/making-sense-of-digital-and-physical-documents-using-ml-and-optical-character-recognition

The Fifth Elephant 2017