Jul 2026

20 Mon

21 Tue

22 Wed

23 Thu

24 Fri

25 Sat

26 Sun

Jul 2026

27 Mon

28 Tue

29 Wed

30 Thu

31 Fri 08:45 AM – 06:00 PM IST

1 Sat

2 Sun

NIMHANS Convention Centre, Bengaluru,

Tickets

All submissions

Previous Next

This submission has been added to the schedule

From PDF to SQL

Submitted Jun 25, 2026

I am submitting for: Track 2 - Building & implementing AI tools & agents in production Type of session: 15 mins talk

Fifth Elephant 2026 Submission

Description

Most “ask my documents” systems stop at retrieval: upload PDFs, create embeddings, and search for relevant chunks. That works when the user wants surrounding context, but it starts breaking down when the user needs numbers for a report, dashboard, analysis, or presentation. In Indian mutual fund reports, the useful answers are usually structured facts: scheme returns, AUM, portfolio holdings, sector allocations, risk metrics, benchmark comparisons, and month-on-month changes. For these workflows, the durable output is not a vector. It is a queryable domain model.

This session shares lessons from building a production pipeline that converts Indian mutual fund PDFs, Excel holdings files, and visual financial reports into structured tables. The system renders pages, handles OCR and visual-table ambiguity, runs a two-pass LLM workflow for raw capture and schema normalization, writes Parquet outputs, benchmarks model choices, tracks extraction cost, and exposes the final data through OrcaSheets for plain-English and SQL-style querying. I will cover the parts that worked, and the parts that were painful: dense holdings pages, page chunking, inconsistent AMC formats, hallucinated rows, schema drift, cost control, human review loops, and why converting unstructured data into a known queryable format can be more valuable than repeatedly searching it.

1-2 Takeaways

For numeric insight, structured extraction often beats vector retrieval. Vectors help find context, but SQL-like tables are better for aggregation, comparison, auditability,
and repeatable answers.
Domain-specific document conversion is hard, but highly rewarding. Once messy reports become a known schema, query cost drops, outputs become more predictable, and humans can review rows instead of reading entire documents again.

Audience

This session will be useful for:

Data engineers building pipelines from PDFs, Excel files, reports, or semi-structured business documents.
AI engineers working on LLM extraction, OCR, document understanding, evaluation, and cost control.
Analytics and BI teams who need reliable numbers from unstructured sources for dashboards, reports, and presentations.
Founders and product engineers building AI-native data products where plain-English querying must produce predictable, auditable answers.
Teams evaluating when to use vector databases, Elasticsearch, SQL, Parquet, or structured domain models for document-heavy workflows.

Bio

I am Navdeep, co-founder/operator building OrcaSheets, a data product focused on making operational and business data queryable through plain English and structured workflows. Our work sits at the intersection of data engineering, AI extraction, and analytics infrastructure: converting messy real-world inputs such as PDFs, Excel files, APIs, and event streams into reliable data models that teams can query, review, and use in reports or dashboards.

For this project, We worked on converting Indian mutual fund reports and holdings data into structured, queryable tables, dealing with visual PDFs, OCR issues, LLM benchmarking, schema normalization, Parquet storage, and cost-aware production workflows.

Draft Slides Link

From PDFs to SQL: Turning Indian Mutual Fund Reports into Queryable Data

All submissions

Previous Next

Comments

Jul 2026

20 Mon

21 Tue

22 Wed

23 Thu

24 Fri

25 Sat

26 Sun

Jul 2026

27 Mon

28 Tue

29 Wed

30 Thu

31 Fri 08:45 AM – 06:00 PM IST

1 Sat

2 Sun

Get your hybrid access ticket

Hosted by

The Fifth Elephant

Jumpstart better data engineering and AI futures

Supported by

Platinum Sponsor

Atlassian

Atlassian unleashes the potential of every team. Our agile & DevOps, IT service management and work management software helps teams organize, discuss, and compl

Platinum Sponsor

Sahaj Software

Sahaj is an artisanal technology services company crafting purpose-built AI and data-led solutions for businesses.

Gold Sponsor

Skyflow

Skyflow secures the flow of data across datastores, models, and agents. Enterprises turn to Skyflow as their runtime AI data control layer to protect sensitive

Bronze Sponsor

Fastah

Internet infrastructure APIs for IP geolocation and more

Bronze Sponsor

Firebolt Analytics

Open Source Analytical Database for the AI era.

Community sponsor

ClawMetry

Real-time Observability & Governance layer for AI agents

The Fifth Elephant 2026 Annual Conference

From PDF to SQL

Fifth Elephant 2026 Submission

Description

1-2 Takeaways

Audience

Bio

Draft Slides Link

Comments