The Fifth Elephant 2023 Monsoon

On AI, industrial applications of ML, and MLOps

Tickets

Loading…

Nishant Singh

@nrohlable

Efficient AI pipeline for Entity Extraction from Government Records

Submitted Jun 30, 2023

Abstract:

Nowadays in this digital world efficient extraction of Entities from various Government records like Pan card, Adhar card, Driving License and etc. has become a priority for various use cases like Authentication, KYC Compliance, Partner/Customer Onboarding, Age Validation etc. in a wide number of sectors. Solving such an essential problem also comes with a variety of challenges like variations in image quality , uneven orientation, Inclusion of unnecessary background, Compressed Images, Proper Gap detection between texts etc. An open source solution which could efficiently provide us with these entity information while tackling all the challenges mentioned above was something we were missing out on and could help various firms based on Logistics, Manufacturing , Service - providing , Partner based start-ups etc. Motivated by these observations during initial analysis, we introduce an Entity Extraction pipeline which could be easily used for different Government records only by introducing changes specific to type of records and Entities placement / Entities specific Regex. In particular, using the Entity Extraction pipeline we were able to extract various entities i.e, Name, Date of birth , PAN ID through Pan Cards with 97.27% , 98.10%, 97.87% accuracy respectively.

Pipeline Flow:

The following module below are involved in the pipeline step-wise:

a) Card Segmentation : Binary Segementation to detect ROI from Government record images
b) Segmentation post processor : Creation of Contours based on Mask from Segmentation block and Cropping out from based on conditions
c) Image Preprocessor : Introducing changes in properties of Image for Angle Dectection
d) Angle Detection : Detection of correct oreintation of text in image using Thresholding + Mask and Houghlines Algorithm
e) OCR Block : Entity Extraction using LanyOCR and Creation of Information DataFrame
f) Regex Block : Entity Specific Regex and Entity positioning conditions along with Re-iteration of OCR based on bbox ratio and entity conditions
g) Optional Recognition block : Gap Detection using Vertical Histogram Logic and Text Recognition using DOCTR of split images

Talk Outline:

We are looking forward to discuss our Implementation under following order points :
1.) Importance of Entity Extraction from Government Records
2.) Problems associated with real time Images used for Entity Extraction
3.) How have we designed a pipeline consisting of various modules to minimize the impact of such problems
4.) Overview of Pipeline and their corresponding modules along with there working
5.) Why we choose LanOCR for Entity Extraction over other OCR Algorithms
6.) How could we use this pipeline for various records by introducing minor changes
7.) What would be the necessary steps taken in order to overcome problems related to it
8.) Possible drawbacks with the pipeline due to Image quality and human related errors
9.) Optimisation of Pipeline in-terms of Inference speed
10.) Various Use cases and Further Scope of improvements for pipeline

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hybrid access (members only)

Hosted by

Jump starting better data engineering and AI futures

Supported by

E2E Cloud is India's first AI hyper scaler, a cloud computing platform providing accelerated cloud-based solutions at maximum optimization and lowest pricing