Navdeep Agarwal

@orcanavdeep

From PDF to SQL

Submitted Jun 25, 2026

Fifth Elephant 2026 Submission

Description

Most “ask my documents” systems stop at retrieval: upload PDFs, create embeddings, and search for relevant chunks. That works when the user wants surrounding context, but it starts breaking down when the user needs numbers for a report, dashboard, analysis, or presentation. In Indian mutual fund reports, the useful answers are usually structured facts: scheme returns, AUM, portfolio holdings, sector allocations, risk metrics, benchmark comparisons, and month-on-month changes. For these workflows, the durable output is not a vector. It is a queryable domain model.

This session shares lessons from building a production pipeline that converts Indian mutual fund PDFs, Excel holdings files, and visual financial reports into structured tables. The system renders pages, handles OCR and visual-table ambiguity, runs a two-pass LLM workflow for raw capture and schema normalization, writes Parquet outputs, benchmarks model choices, tracks extraction cost, and exposes the final data through OrcaSheets for plain-English and SQL-style querying. I will cover the parts that worked, and the parts that were painful: dense holdings pages, page chunking, inconsistent AMC formats, hallucinated rows, schema drift, cost control, human review loops, and why converting unstructured data into a known queryable format can be more valuable than repeatedly searching it.

1-2 Takeaways

  1. For numeric insight, structured extraction often beats vector retrieval. Vectors help find context, but SQL-like tables are better for aggregation, comparison, auditability,
    and repeatable answers.

  2. Domain-specific document conversion is hard, but highly rewarding. Once messy reports become a known schema, query cost drops, outputs become more predictable, and humans can review rows instead of reading entire documents again.

Audience

This session will be useful for:

  • Data engineers building pipelines from PDFs, Excel files, reports, or semi-structured business documents.
  • AI engineers working on LLM extraction, OCR, document understanding, evaluation, and cost control.
  • Analytics and BI teams who need reliable numbers from unstructured sources for dashboards, reports, and presentations.
  • Founders and product engineers building AI-native data products where plain-English querying must produce predictable, auditable answers.
  • Teams evaluating when to use vector databases, Elasticsearch, SQL, Parquet, or structured domain models for document-heavy workflows.

Bio

I am Navdeep, co-founder/operator building OrcaSheets, a data product focused on making operational and business data queryable through plain English and structured workflows. Our work sits at the intersection of data engineering, AI extraction, and analytics infrastructure: converting messy real-world inputs such as PDFs, Excel files, APIs, and event streams into reliable data models that teams can query, review, and use in reports or dashboards.

For this project, We worked on converting Indian mutual fund reports and holdings data into structured, queryable tables, dealing with visual PDFs, OCR issues, LLM benchmarking, schema normalization, Parquet storage, and cost-aware production workflows.

From PDFs to SQL: Turning Indian Mutual Fund Reports into Queryable Data

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jumpstart better data engineering and AI futures