Speak at The Fifth Elephant 2026 Annual Conference
Share you work with the community
Jul 2026
27 Mon
28 Tue
29 Wed
30 Thu
31 Fri 09:00 AM – 06:00 PM IST
1 Sat
2 Sun
Navdeep Agarwal
@orcanavdeep
Submitted Jun 25, 2026
Most “ask my documents” systems stop at retrieval: upload PDFs, create embeddings, and search for relevant chunks. That works when the user wants surrounding context, but it starts breaking down when the user needs numbers for a report, dashboard, analysis, or presentation. In Indian mutual fund reports, the useful answers are usually structured facts: scheme returns, AUM, portfolio holdings, sector allocations, risk metrics, benchmark comparisons, and month-on-month changes. For these workflows, the durable output is not a vector. It is a queryable domain model.
This session shares lessons from building a production pipeline that converts Indian mutual fund PDFs, Excel holdings files, and visual financial reports into structured tables. The system renders pages, handles OCR and visual-table ambiguity, runs a two-pass LLM workflow for raw capture and schema normalization, writes Parquet outputs, benchmarks model choices, tracks extraction cost, and exposes the final data through OrcaSheets for plain-English and SQL-style querying. I will cover the parts that worked, and the parts that were painful: dense holdings pages, page chunking, inconsistent AMC formats, hallucinated rows, schema drift, cost control, human review loops, and why converting unstructured data into a known queryable format can be more valuable than repeatedly searching it.
For numeric insight, structured extraction often beats vector retrieval. Vectors help find context, but SQL-like tables are better for aggregation, comparison, auditability,
and repeatable answers.
Domain-specific document conversion is hard, but highly rewarding. Once messy reports become a known schema, query cost drops, outputs become more predictable, and humans can review rows instead of reading entire documents again.
This session will be useful for:
I am Navdeep, co-founder/operator building OrcaSheets, a data product focused on making operational and business data queryable through plain English and structured workflows. Our work sits at the intersection of data engineering, AI extraction, and analytics infrastructure: converting messy real-world inputs such as PDFs, Excel files, APIs, and event streams into reliable data models that teams can query, review, and use in reports or dashboards.
For this project, We worked on converting Indian mutual fund reports and holdings data into structured, queryable tables, dealing with visual PDFs, OCR issues, LLM benchmarking, schema normalization, Parquet storage, and cost-aware production workflows.
From PDFs to SQL: Turning Indian Mutual Fund Reports into Queryable Data
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}