The Fifth Elephant 2017

On data engineering and application of ML in diverse domains

##Theme and format
The Fifth Elephant 2017 is a four-track conference on:

  1. Data engineering – building pipelines and platforms; exposure to latest open source tools for data mining and real-time analytics.
  2. Application of Machine Learning (ML) in diverse domains such as IOT, payments, e-commerce, education, ecology, government, agriculture, computational biology, social network analysis and emerging markets.
  3. Hands-on tutorials on data mining tools, and ML platforms and techniques.
  4. Off-the-record (OTR) sessions on privacy issues concerning data; building data pipelines; failure stories in ML; interesting problems to solve with data science; and other relevant topics.

The Fifth Elephant is a conference for practitioners, by practitioners.

Talk submissions are now closed.

You must submit the following details along with your proposal, or within 10 days of submission:

  1. Draft slides, mind map or a textual description detailing the structure and content of your talk.
  2. Link to a self-record, two-minute preview video, where you explain what your talk is about, and the key takeaways for participants. This preview video helps conference editors understand the lucidity of your thoughts and how invested you are in presenting insights beyond your use case. Please note that the preview video should be submitted irrespective of whether you have spoken at past editions of The Fifth Elephant.
  3. If you submit a workshop proposal, you must specify the target audience for your workshop; duration; number of participants you can accommodate; pre-requisites for the workshop; link to GitHub repositories and documents showing the full workshop plan.

##About the conference
This year is the sixth edition of The Fifth Elephant. The conference is a renowned gathering of data scientists, programmers, analysts, researchers, and technologists working in the areas of data mining, analytics, machine learning and deep learning from different domains.

We invite proposals for the following sessions, with a clear focus on the big picture and insights that participants can apply in their work:

  • Full-length, 40-minute talks.
  • Crisp, 15-minute talks.
  • Sponsored sessions, of 15 minutes and 40 minutes duration (limited slots available; subject to editorial scrutiny and approval).
  • Hands-on tutorials and workshop sessions of 3-hour and 6-hour duration where participants follow instructors on their laptops.
  • Off-the-record (OTR) sessions of 60-90 minutes duration.

##Selection Process

  1. Proposals will be filtered and shortlisted by an Editorial Panel.
  2. Proposers, editors and community members must respond to comments as openly as possible so that the selection processs is transparent.
  3. Proposers are also encouraged to vote and comment on other proposals submitted here.

Selection Process Flowchart

We will notify you if we move your proposal to the next round or reject it. A speaker is NOT confirmed for a slot unless we explicitly mention so in an email or over any other medium of communication.

Selected speakers must participate in one or two rounds of rehearsals before the conference. This is mandatory and helps you to prepare well for the conference.

There is only one speaker per session. Entry is free for selected speakers.

##Travel grants
Partial or full grants, covering travel and accomodation are made available to speakers delivering full sessions (40 minutes) and workshops. Grants are limited, and are given in the order of preference to students, women, persons of non-binary genders, and speakers from Asia and Africa.

##Commitment to Open Source
We believe in open source as the binding force of our community. If you are describing a codebase for developers to work with, we’d like for it to be available under a permissive open source licence. If your software is commercially licensed or available under a combination of commercial and restrictive open source licences (such as the various forms of the GPL), you should consider picking up a sponsorship. We recognise that there are valid reasons for commercial licensing, but ask that you support the conference in return for giving you an audience. Your session will be marked on the schedule as a “sponsored session”.

##Important Dates:

  • Deadline for submitting proposals: June 10
  • First draft of the coference schedule: June 20
  • Tutorial and workshop announcements: June 20
  • Final conference schedule: July 5
  • Conference dates: 27-28 July

##Contact
For more information about speaking proposals, tickets and sponsorships, contact info@hasgeek.com or call +91-7676332020.

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

Nitin Saraswat

@chunky

Making sense of Digital and Physical Documents using ML and Optical Character Recognition

Submitted Jun 10, 2017

Have you ever wondered what could you do with the piece of paper that you have at hand when you make a purchase at your local grocery store, get your car’s tank full, see a doctor when you are ill, go to a loan provider to get a quick loan and much more!

People usually throw away this piece of paper called a receipt or bill, thinking its just another junk paper.

But our startup has realized that it’s no more a piece of paper and has the key to untapped data that can empower consumers and business equally and help in intelligent decision making, forecasting, boosting sales, improving diagnosis and much more.

In this session we are going to share how Optical Character Recognition and Machine Learning are going to have a significant impact in making sense of the digital data which is not directly available in a format that can be processed in a ML pipeline.

As of now no single solution exists which can tackle the problem of extracting data from such physical or digital documents and applying Machine Learning to bring out intelligent insights.

Outline

This will be a walkthrough and demo session where in raw data will be sourced in real time and the different pieces of the solution will be discussed namely -

  • Optical Character Recognition

  • Data extraction and Text Parsing

  • Data Storage

  • Getting Predictions and Recommendations out of ML models

  • End user Visualization

The softwares used are mainly open source or free like Python, Mysql, Redis, Spark 2.0, OCR engine used is ABBYY FineReader and we will spend 5 minutes on this.

We will pick 3 use cases from Retail, Finance and Healthcare and demonstrate how the solution works.

Retail -- 10 min

From a Consumer’s Perspective -

You just bought some groceries and 10 days later you realize sugar is finished and then the sudden awakening, did you even buy sugar in your last purchase!

If you are single, maybe some days can be spent basking in glory of being healthy by the cut down on sugar but if you are married you dont have a choice :) ! Guilt ridden you buy 2 times the usual quantity and even if its the usual quantity it will last longer and will be an inventory at your home.

Now, there are multiple items that you buy in a single visit and at SKU level (stock keeping unit) at the grocery store.

The business problem we are trying to solve is - Is there someway you can keep track of your purchases based out of the information pulled from your grocery bills and let the data speak up as to when and at what SKU level would you need to purchase grocery products and is there a possibility to perhaps optimize monthly spending.

Finance -- 10 min

Lets think from a Business perspective -

Suppose you are a Lending Company and there is an untapped market for lending in various segments like small ticket loans, non-salaried people etc. There are very few sources of judging the loan eligibility of the individual like monthly bank account statement, any existing loans documents and others.

As an average sized lending company you will need to make sense of monthly bank account statements of atleast 20-30 folks (at a city level, conservatively) on a daily basis and any errors in manual scanning and interpretation of documents, means lost business and bad reputation.

Take this to Tier2, Tier3 cities in a developing economy say India and you have a massive untapped pool of data, ready to be explored.

The busines problem we are trying to solve is - Is there some intelligent way to extract data from these documents and apply Machine Learning models to determine the loan eligibility of the individual and do his risk profiling.

Healthcare --10 min

Another striking use case is in the Healthcare industry -

It is well known that a patient’s history is very important in assessing his ailments and deciding upon an optimal course of treatment.And the patients try to preserve as much history as possible but is it really possible to preserve
a 10 year old prescription or a 5 year old report in a a physical format, which can give the doctors a fair amount of idea about patients history.

The challenge in India is there is no one single repository that houses all historical data of an individual.The onus of preservation of reports and prescription lies on the individual and preserving physical files is a challenge in these days. Hence, invariably some history is lost.

Add to it that an idividual may visit different hospitals for different or maybe same problems and has multiple physical prescriptions/reports from all these hospitals.

Suddenly the individual develops some complication and he is referred to some other hospital maybe in a different city or location.

The business problem we are trying to solve is - Is there some way we can build a central repo (keeping the security and integrity intact) about the patient’s ailments and can chart out patient’s entire history from those physical prescriptions/reports, to enable doctors better diagnose and prescribe a superior course of treatment.

Q&A -- 5 min

Requirements

This will be a walkthrough session wherein the workflow will be explained and a demo will be shown. Since the solution runs as a mobile app, only internet connectivity will be required and all the components will be explained.

Speaker bio

Nitin Saraswat is a Data Scientist and works as an independent Consultant advising orgranizations on the Big Data Machine Learning Stack for solving business problems. Nitin is passionate about technology enabled disruptions and believes ML is not be a job eroder rather an opportunities creator.

Slides

https://www.slideshare.net/chunkybaba/making-sense-of-digital-and-physical-documents-using-ml-and-optical-character-recognition

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more