Applying Machine Learning Technique to Create Hierarchical Structure out of Documents

Feb 2017

13 Mon

14 Tue

15 Wed

16 Thu 09:00 AM – 06:00 PM IST

17 Fri 09:00 AM – 06:00 PM IST

18 Sat

19 Sun

AMANORA THE FERN HOTELS AND CLUB, PUNE, Pune

Applying Machine Learning Technique to Create Hierarchical Structure out of Documents

Submitted Nov 29, 2016

Technical level: Intermediate

This talk is about understanding the structure of a document and presenting the document in hierarchical format like Table of Contents using machine learning techniques.

Generally, when users write document, the structure(format) of the document is defined by the fontsize, bold and various other attributes. There is no explicit representation of hierarchy of information i.e in other words Table of Contents. Extracting these structures is key to different use cases in business say for example in RFPs (Request for Proposals). The RFPs are similar across different clients and you may want to use the same RFP to some other client by modifying few paragraphs or contents. So, Identifying similar/duplicate content from different RFPs is needed. To identify the duplicate content, we need to extract these contents from the RFPs and then do the duplicate detection. The extraction of content is like identifying the Table of Contents from the Document and this can be done using machine learning technique viz. Decision Trees.

Outline

Introduction
Information Extraction
Machine Learning - Decision Tree.
Demo of Results

Speaker bio

Currently, Working as Sr. Data Scientist at Red Hat. Over a decade of experience in the Data field solving various data related problems. Helped various organizations to convert business problems to technical ones. Worked for companies like GE, Network18 and also have exposure to startup world.
My experience spans from building products from scratch, Working over half a dozen data products , managing and leading a team and running my own firm. I have spent my career in startups and have broad experience in the field of machine learning, search, NLP,crawling, scalability etc. I have also worked in different data domains - Media, Finance (news), HealthCare, Conversation (Chat bots) etc.

PyCon Pune 2017