Kusumakar Bodha

@kusumakarb

From Documents to Data: Tiered Extraction at Enterprise Scale

Submitted Jun 25, 2026

Description
At a handful of documents, extraction is trivial — you upload them to ChatGPT and start asking questions. Across a production corpus of millions of documents — images, PDFs, spreadsheets, slide decks, office files — that grows in bursts as each new customer onboards, it becomes a data-engineering problem: turning every kind of messy enterprise file into clean, retrievable data, reliably and affordably. This session is a field report on the pipeline we built to do that, and its central claim is that no single extractor wins. Each reads something the others can’t — lightweight parsers handle digital text and spreadsheets, OCR recovers scans, GPU-backed open-source models on Ray improve layout and table recovery, and multimodal LLMs interpret charts and visual semantics — and the tiers differ in cost by orders of magnitude, so running an LLM over every page is economically impossible. Instead of picking one, a complexity analyzer routes each document on two levels — across file types, and by complexity within a type — to the cheapest tier that’s good enough. The mix swings entirely with the workload, which is exactly why the strategy can’t be hard-coded.

The harder half is the system, not the models: making a CPU service, OCR APIs, a GPU pool, and LLM APIs behave as one high-throughput pipeline that absorbs onboarding bursts, isolates failures, and normalizes every output to a single markdown-and-metadata format, so retrieval never has to know which backend produced the text. We close on the part that’s still unsolved — routing commits a document to one strategy from a cheap upfront sample that may not represent it, so the real frontier is verifying extraction quality inside the pipeline and re-processing only what failed, at throughput, without re-running everything.

Takeaways

  • A practical way to treat document extraction as adaptive, cost- and capability-aware routing — matching each document to the cheapest backend that’s good enough, instead of standardizing on one tool or throwing an LLM at everything.
  • How to run heterogeneous extraction backends (CPU, OCR, GPU, LLM) as one pipeline — normalizing their outputs, absorbing bursty ingestion, and why extraction quality is best treated as an ongoing measurement problem, not a one-shot decision.

Who Should Attend
Data and ML engineers building ingestion or RAG pipelines over heterogeneous enterprise documents; platform and infrastructure engineers running document processing at scale; and anyone whose search, RAG, or document-AI answer quality is bottlenecked by how well the underlying text was extracted.

Bio
Kusumakar Bodha - Platform Lead - Needl.ai

https://docs.google.com/presentation/d/1WpFZKvB8jsNK_4WcQ5Eo8gPWNModiTO_4AZh0BDzdCc/edit?usp=sharing

{Add the link to 2-min elevator pitch video}

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jumpstart better data engineering and AI futures