The Fifth Elephant 2023 Monsoon

On AI, industrial applications of ML, and MLOps

Tickets
  • Select Tickets
  • Payment
  • invoice
  • Attendee details

Membership

The Fifth Elephant annual membership

The Fifth Elephant membership is valid for one year - 12 months. The member get the following benefits:

  • Participation in all online peer review sessions.
  • Access to all recordings from online reviews.
  • Priority access to all offline meet-ups and online workshops hosted by The Fifth Elephant during the one year period.
  • Access to The Fifth Elephant’s Annual Conference on 18 and 19 July 2025 in Bangalore - in-person and virtually (via live stream).

Corporate Members-only benefits (bulk ticket purchase):

  • Transfer of memberships across individuals in the organization.

Memberships can be cancelled within 1 hour of purchase.

₹5100

×

Sale at this price closes on December 31, 2025

Total ₹0

Cancellation and refund policy

Memberships can be cancelled within 1 hour of purchase

Workshop tickets can be cancelled or transferred upto 24 hours prior to the workshop.

For further queries, please write to us at support@hasgeek.com or call us at +91 7676 33 2020.

Nishant Singh

@nrohlable

Efficient AI pipeline for Entity Extraction from Government Records

Submitted Jun 30, 2023

Abstract:

Nowadays in this digital world efficient extraction of Entities from various Government records like Pan card, Adhar card, Driving License and etc. has become a priority for various use cases like Authentication, KYC Compliance, Partner/Customer Onboarding, Age Validation etc. in a wide number of sectors. Solving such an essential problem also comes with a variety of challenges like variations in image quality , uneven orientation, Inclusion of unnecessary background, Compressed Images, Proper Gap detection between texts etc. An open source solution which could efficiently provide us with these entity information while tackling all the challenges mentioned above was something we were missing out on and could help various firms based on Logistics, Manufacturing , Service - providing , Partner based start-ups etc. Motivated by these observations during initial analysis, we introduce an Entity Extraction pipeline which could be easily used for different Government records only by introducing changes specific to type of records and Entities placement / Entities specific Regex. In particular, using the Entity Extraction pipeline we were able to extract various entities i.e, Name, Date of birth , PAN ID through Pan Cards with 97.27% , 98.10%, 97.87% accuracy respectively.

Pipeline Flow:

The following module below are involved in the pipeline step-wise:

a) Card Segmentation : Binary Segementation to detect ROI from Government record images
b) Segmentation post processor : Creation of Contours based on Mask from Segmentation block and Cropping out from based on conditions
c) Image Preprocessor : Introducing changes in properties of Image for Angle Dectection
d) Angle Detection : Detection of correct oreintation of text in image using Thresholding + Mask and Houghlines Algorithm
e) OCR Block : Entity Extraction using LanyOCR and Creation of Information DataFrame
f) Regex Block : Entity Specific Regex and Entity positioning conditions along with Re-iteration of OCR based on bbox ratio and entity conditions
g) Optional Recognition block : Gap Detection using Vertical Histogram Logic and Text Recognition using DOCTR of split images

Talk Outline:

We are looking forward to discuss our Implementation under following order points :
1.) Importance of Entity Extraction from Government Records
2.) Problems associated with real time Images used for Entity Extraction
3.) How have we designed a pipeline consisting of various modules to minimize the impact of such problems
4.) Overview of Pipeline and their corresponding modules along with there working
5.) Why we choose LanOCR for Entity Extraction over other OCR Algorithms
6.) How could we use this pipeline for various records by introducing minor changes
7.) What would be the necessary steps taken in order to overcome problems related to it
8.) Possible drawbacks with the pipeline due to Image quality and human related errors
9.) Optimisation of Pipeline in-terms of Inference speed
10.) Various Use cases and Further Scope of improvements for pipeline

Comments

Login to leave a comment

No comments posted yet

Hybrid access (members only)

Hosted by

Jump starting better data engineering and AI futures

Supported by

E2E Cloud is India's first AI hyper scaler, a cloud computing platform providing accelerated cloud-based solutions at maximum optimization and lowest pricing