Extracting Data from Historical Documents: Harnessing LLMs to Parse Format-Variant Tables at Scale

Sep 2025

15 Mon 11:00 AM – 11:59 PM IST

16 Tue 11:00 AM – 11:59 PM IST

17 Wed 11:00 AM – 11:59 PM IST

18 Thu 11:00 AM – 11:59 PM IST

19 Fri 11:00 AM – 11:59 PM IST

20 Sat 11:00 AM – 11:59 PM IST

21 Sun 11:00 AM – 11:59 PM IST

Sep 2025

22 Mon 11:00 AM – 11:59 PM IST

23 Tue 11:00 AM – 11:59 PM IST

24 Wed 11:00 AM – 11:59 PM IST

25 Thu 11:00 AM – 11:59 PM IST

26 Fri 11:00 AM – 11:59 PM IST

27 Sat 11:00 AM – 11:59 PM IST

28 Sun 11:00 AM – 11:59 PM IST

Sep 2025

29 Mon 11:00 AM – 11:59 PM IST

30 Tue 11:00 AM – 11:59 PM IST

1 Wed 11:00 AM – 11:59 PM IST

2 Thu 11:00 AM – 11:59 PM IST

3 Fri 11:00 AM – 11:59 PM IST

4 Sat 11:00 AM – 11:59 PM IST

5 Sun 11:00 AM – 11:59 PM IST

Oct 2025

6 Mon 11:00 AM – 11:59 PM IST

7 Tue 11:00 AM – 11:59 PM IST

8 Wed 11:00 AM – 11:59 PM IST

9 Thu 11:00 AM – 11:59 PM IST

10 Fri 11:00 AM – 11:59 PM IST

11 Sat 11:00 AM – 11:59 PM IST

12 Sun 11:00 AM – 11:59 PM IST

Oct 2025

13 Mon 11:00 AM – 11:59 PM IST

14 Tue 11:00 AM – 11:59 PM IST

15 Wed 11:00 AM – 11:59 PM IST

16 Thu 11:00 AM – 11:59 PM IST

17 Fri 11:00 AM – 11:59 PM IST

18 Sat 11:00 AM – 11:59 PM IST

19 Sun 11:00 AM – 11:59 PM IST

Oct 2025

20 Mon 11:00 AM – 11:59 PM IST

21 Tue 11:00 AM – 11:59 PM IST

22 Wed 11:00 AM – 11:59 PM IST

23 Thu 11:00 AM – 11:59 PM IST

24 Fri 11:00 AM – 11:59 PM IST

25 Sat 11:00 AM – 11:59 PM IST

26 Sun 11:00 AM – 11:59 PM IST

Oct 2025

27 Mon 11:00 AM – 11:59 PM IST

28 Tue 11:00 AM – 11:59 PM IST

29 Wed 11:00 AM – 11:59 PM IST

30 Thu 11:00 AM – 11:59 PM IST

31 Fri 11:00 AM – 11:59 PM IST

1 Sat 11:00 AM – 11:59 PM IST

2 Sun 11:00 AM – 11:59 PM IST

Nov 2025

3 Mon 11:00 AM – 11:59 PM IST

4 Tue 11:00 AM – 11:59 PM IST

5 Wed 11:00 AM – 11:59 PM IST

6 Thu 11:00 AM – 11:59 PM IST

7 Fri 11:00 AM – 11:59 PM IST

8 Sat 11:00 AM – 11:59 PM IST

9 Sun 11:00 AM – 11:59 PM IST

Nov 2025

10 Mon 11:00 AM – 11:59 PM IST

11 Tue 11:00 AM – 11:59 PM IST

12 Wed 11:00 AM – 11:59 PM IST

13 Thu 11:00 AM – 11:59 PM IST

14 Fri 11:00 AM – 11:59 PM IST

15 Sat 11:00 AM – 11:59 PM IST

16 Sun 11:00 AM – 11:59 PM IST

Nov 2025

17 Mon 11:00 AM – 11:59 PM IST

18 Tue 11:00 AM – 11:59 PM IST

19 Wed 11:00 AM – 11:59 PM IST

20 Thu 11:00 AM – 11:59 PM IST

21 Fri 11:00 AM – 11:59 PM IST

22 Sat 11:00 AM – 11:59 PM IST

23 Sun 11:00 AM – 11:59 PM IST

Nov 2025

24 Mon 11:00 AM – 11:59 PM IST

25 Tue 11:00 AM – 11:59 PM IST

26 Wed 11:00 AM – 11:59 PM IST

27 Thu 11:00 AM – 11:59 PM IST

28 Fri 11:00 AM – 11:59 PM IST

29 Sat 11:00 AM – 11:59 PM IST

30 Sun 11:00 AM – 11:59 PM IST

Dec 2025

1 Mon 11:00 AM – 11:59 PM IST

2 Tue 11:00 AM – 11:59 PM IST

3 Wed 11:00 AM – 11:59 PM IST

4 Thu 11:00 AM – 11:59 PM IST

5 Fri

6 Sat

7 Sun

Extracting Data from Historical Documents: Harnessing LLMs to Parse Format-Variant Tables at Scale

Submitted Nov 9, 2025

Type of submission: 30 mins talk

Traditional OCR systems are excellent at recognizing text but fall short in understanding the structure of complex historical documents—particularly tables with varying column arrangements, inconsistent layouts, and annotations. This structural ambiguity often makes it infeasible to extract reliable, machine-readable data from large collections of such documents.

We encountered these challenges while extracting village-level demographic and public goods data from India’s 1951 Population Census (PC51) district handbooks. These handbooks contain invaluable granular data that could unlock decades-spanning research across economics, demography, and development. However, their ad-hoc formatting across states and complex visual structures made OCR-based approaches unsuitable for systematic data extraction.

We developed an LLM-based pipeline purpose-built for large-scale extraction from such format-variant documents. Our approach leverages LLMs’ ability to interpret context and infer structural meaning—making it possible, for the first time, to extract harmonized microdata from these messy tabular layouts. Crucially, we focused on reliability: combining LLMs with traditional, rule-based techniques such as manually defined table types, schema templates, and a robust evaluation framework to ensure data accuracy and consistency.

This talk will present our technical pipeline architecture, evaluation methodology, and key lessons learned. We’ll share practical principles for designing LLM-based extraction systems that generalize, and balance automation with human-informed rules.

Key Takeaways:

How we built a reliable LLM-powered extraction pipeline for format-variant historical documents, such as district handbooks from the 1951 Population Census.
Combining contextual reasoning with rule-based templates and structured evaluations.
When to replace or augment traditional methods with LLM-driven extraction.
What new data sources can be unlocked as these models evolve.

Who Should Attend:

Data practitioners and researchers working with complex, historical, or unstructured document collections.
Engineers exploring scalable, reliable LLM workflows.
Anyone interested in unlocking data trapped in legacy formats.

Bio
Aryan Srivastava is a Data Scientist at Development Data Lab.

The Fifth Elephant 2025 Winter Edition Call for Proposals

Extracting Data from Historical Documents: Harnessing LLMs to Parse Format-Variant Tables at Scale

Comments