PyCon, the gathering for the community using and developing the open-source Python programming language. This is the first year of the PyCon Pune where the community will meet for two days of talks and working on upstream projects in two days of dev sprint. CFP ends on 30th November AoE.
Applying Machine Learning Technique to Create Hierarchical Structure out of Documents
This talk is about understanding the structure of a document and presenting the document in hierarchical format like Table of Contents using machine learning techniques.
Generally, when users write document, the structure(format) of the document is defined by the fontsize, bold and various other attributes. There is no explicit representation of hierarchy of information i.e in other words Table of Contents. Extracting these structures is key to different use cases in business say for example in RFPs (Request for Proposals). The RFPs are similar across different clients and you may want to use the same RFP to some other client by modifying few paragraphs or contents. So, Identifying similar/duplicate content from different RFPs is needed. To identify the duplicate content, we need to extract these contents from the RFPs and then do the duplicate detection. The extraction of content is like identifying the Table of Contents from the Document and this can be done using machine learning technique viz. Decision Trees.
Machine Learning - Decision Tree.
Demo of Results
Currently, Working as Sr. Data Scientist at Red Hat. Over a decade of experience in the Data field solving various data related problems. Helped various organizations to convert business problems to technical ones. Worked for companies like GE, Network18 and also have exposure to startup world.
My experience spans from building products from scratch, Working over half a dozen data products , managing and leading a team and running my own firm. I have spent my career in startups and have broad experience in the field of machine learning, search, NLP,crawling, scalability etc. I have also worked in different data domains - Media, Finance (news), HealthCare, Conversation (Chat bots) etc.