The Fifth Elephant 2020 edition

On data governance, engineering for data privacy and data science

Case Study - Information Retrieval from millions of legal documents using Deep Learning models

Submitted by Santosh on May 27, 2020

Status: Submitted


Information Retrieval (Named Entity Recognition) is one of the most widely used applications in NLP. Though most of us understand the building blocks of named entity recognition frameworks, we are usually blind to the challenges faced while dealing with real-time problem statements, especially the ones that deal with scale. Over the last one year, the Data-Science team at CoffeeBeans was fortunate enough to get a chance to work at great depths in developing deep learning solutions to tackle such interesting problem statements.

At a high level, the problem definition was to develop end to end deep learning solutions to extract over 100 fields from each of the millions of legal documents which mostly deal with the real estate transactions. The length of each document varies from 10-20 pages. The design of the framework involves an ensemble of classification, extraction and relationship mapping models to extract entities ranging from document level information to house addresses and names of the parties involved in the real estate transaction. Our persistent efforts have resulted in significant cost savings to our clients in terms of reduced dependency on manual efforts.

In this talk, we would like to share our experiences and learnings from the above described work with the data science community. The presentation starts with an introduction to the nature of the problem statement along with the structure and scale of the data. We then discuss some interesting challenges faced during the data-preprocessing, model training and deployment stages, while also showcasing the solutions that were designed to tackle these challenges. Finally, we shall try to shed light on some interesting observations made during the entire model building process.


  • Introduction to Information Retrieval
  • Nature and structure of the data
  • Solution Design
  • Model Architecture : A variation of Bi-Directional LSTMs
  • Challenges and solutions
  • Frameworks, Tools and Tech-Stack


Basic knowledge of NLP and Deep Learning

Speaker bio

Santosh graduated from IIT Madras with a Dual Degree in Civil and Transportation Engineering. He has over 5 years of experience in data science with an expertise in Deep Learning. He picked up interest in Machine Learning and Computer Vision while he was a part of the research group at Intelligent Transportation Systems Laboratory, IIT Madras. After graduation, Santosh worked with FreeCharge where he helped build their real-time Fraud Detection Systems which was based on advanced ML algorithms. Santosh is currently engaged as a Lead Data Scientist at CoffeeBeans working with Fortune 500 clients prototyping and deploying end to end Deep Learning solutions.


{{ errorMsg }}

You need to be a participant to comment.

Login to leave a comment