Aryan Srivastava

Extracting Data from Historical Documents: Harnessing LLMs to Parse Format-Variant Tables at Scale

Submitted Nov 9, 2025

Traditional OCR systems are excellent at recognizing text but fall short in understanding the structure of complex historical documents—particularly tables with varying column arrangements, inconsistent layouts, and annotations. This structural ambiguity often makes it infeasible to extract reliable, machine-readable data from large collections of such documents.

We encountered these challenges while extracting village-level demographic and public goods data from India’s 1951 Population Census (PC51) district handbooks. These handbooks contain invaluable granular data that could unlock decades-spanning research across economics, demography, and development. However, their ad-hoc formatting across states and complex visual structures made OCR-based approaches unsuitable for systematic data extraction.

We developed an LLM-based pipeline purpose-built for large-scale extraction from such format-variant documents. Our approach leverages LLMs’ ability to interpret context and infer structural meaning—making it possible, for the first time, to extract harmonized microdata from these messy tabular layouts. Crucially, we focused on reliability: combining LLMs with traditional, rule-based techniques such as manually defined table types, schema templates, and a robust evaluation framework to ensure data accuracy and consistency.

This talk will present our technical pipeline architecture, evaluation methodology, and key lessons learned. We’ll share practical principles for designing LLM-based extraction systems that generalize, and balance automation with human-informed rules.

Key Takeaways:

  • How we built a reliable LLM-powered extraction pipeline for format-variant historical documents, such as district handbooks from the 1951 Population Census.

  • Combining contextual reasoning with rule-based templates and structured evaluations.

  • When to replace or augment traditional methods with LLM-driven extraction.

  • What new data sources can be unlocked as these models evolve.

Who Should Attend:

  • Data practitioners and researchers working with complex, historical, or unstructured document collections.

  • Engineers exploring scalable, reliable LLM workflows.

  • Anyone interested in unlocking data trapped in legacy formats.

Bio
Aryan Srivastava is a Data Scientist at Development Data Lab.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jumpstart better data engineering and AI futures