Optimizing Data Ingestion in Apache Pinot

The Problem

You have raw data in object storage — Parquet, JSON, Avro files — and you need to turn it
into queryable, analytics-optimized files. This isn’t just format conversion.

You need to solve:

Organization: Data must be grouped by time and business dimensions so queries skip irrelevant data entirely rather than scanning everything
Query performance: Sort order, indexes, and encodings must be built at ingestion time to reduce query overhead
Exactly-once correctness: No data loss, no duplicates, and updates must be atomically visible to queries
Reliability at scale: Ingest terabytes in hours with minimal infrastructure and without impacting parallel query performance

We start with the foundations: How Pinot stores data in columnar segments optimized for analytical queries, with multi-level data pruning, pluggable indexes, and compression. This context is what makes the ingestion design decisions meaningful.

From there, we walk through the adaptive ingestion architecture built on autoscaling Pinot Minion executors. A single framework handles the full spectrum of ingestion requirements:

Format-aware processing: row-optimized path for JSON/Avro/CSV, column optimized path for Parquet and Pinot files
Re-partitioning: Reshape data layout at ingestion time to match your query patterns
Efficient sorting: See how we flip the architecture to sort the data
Data transformation: Derive columns, coerce types, and apply business logic before data is visible to queries
Dynamic output sizing: Automatically right-size output segments without manual tuning
Atomic visibility: Ensure partial ingestion is never visible to live queries
Cost & Performance trade-offs: Choose to lower your costs by trading memory for disk

Takeaways

Learn how data is stored and queried efficiently in an OLAP database
Learn about various features that Apache Pinot offers
Practical trade-offs — format choice, sort strategy, partitioning — and how to reason about them for your own workloads

Speak at Bengaluru Systems meet-up