From Documents to Data: Tiered Extraction at Enterprise Scale

Jul 2026

27 Mon

28 Tue

29 Wed

30 Thu

31 Fri 09:00 AM – 06:00 PM IST

1 Sat

2 Sun

From Documents to Data: Tiered Extraction at Enterprise Scale

Submitted Jun 25, 2026

I am submitting for: Track 1 - Data engineering & infrastructure Type of session: 15 mins talk

Description
At a handful of documents, extraction is trivial — you upload them to ChatGPT and start asking questions. Across a production corpus of millions of documents — images, PDFs, spreadsheets, slide decks, office files — that grows in bursts as each new customer onboards, it becomes a data-engineering problem: turning every kind of messy enterprise file into clean, retrievable data, reliably and affordably. This session is a field report on the pipeline we built to do that, and its central claim is that no single extractor wins. Each reads something the others can’t — lightweight parsers handle digital text and spreadsheets, OCR recovers scans, GPU-backed open-source models on Ray improve layout and table recovery, and multimodal LLMs interpret charts and visual semantics — and the tiers differ in cost by orders of magnitude, so running an LLM over every page is economically impossible. Instead of picking one, a complexity analyzer routes each document on two levels — across file types, and by complexity within a type — to the cheapest tier that’s good enough. The mix swings entirely with the workload, which is exactly why the strategy can’t be hard-coded.

The harder half is the system, not the models: making a CPU service, OCR APIs, a GPU pool, and LLM APIs behave as one high-throughput pipeline that absorbs onboarding bursts, isolates failures, and normalizes every output to a single markdown-and-metadata format, so retrieval never has to know which backend produced the text. We close on the part that’s still unsolved — routing commits a document to one strategy from a cheap upfront sample that may not represent it, so the real frontier is verifying extraction quality inside the pipeline and re-processing only what failed, at throughput, without re-running everything.

Takeaways

A practical way to treat document extraction as adaptive, cost- and capability-aware routing — matching each document to the cheapest backend that’s good enough, instead of standardizing on one tool or throwing an LLM at everything.
How to run heterogeneous extraction backends (CPU, OCR, GPU, LLM) as one pipeline — normalizing their outputs, absorbing bursty ingestion, and why extraction quality is best treated as an ongoing measurement problem, not a one-shot decision.

Who Should Attend
Data and ML engineers building ingestion or RAG pipelines over heterogeneous enterprise documents; platform and infrastructure engineers running document processing at scale; and anyone whose search, RAG, or document-AI answer quality is bottlenecked by how well the underlying text was extracted.

Bio
Kusumakar Bodha - Platform Lead - Needl.ai

https://docs.google.com/presentation/d/1WpFZKvB8jsNK_4WcQ5Eo8gPWNModiTO_4AZh0BDzdCc/edit?usp=sharing

{Add the link to 2-min elevator pitch video}

Speak at The Fifth Elephant 2026 Annual Conference

From Documents to Data: Tiered Extraction at Enterprise Scale

Comments