Migrating Panacea.AI's 5TB/day Log Platform from Manticore to Clickhouse...and the lessons learnt

Jun 2026

8 Mon

9 Tue

10 Wed

11 Thu

12 Fri 09:00 AM – 06:00 PM IST

13 Sat 09:00 AM – 06:00 PM IST

14 Sun

TERI Auditorium, Bengaluru

Tickets

All submissions

Previous Next

Migrating Panacea.AI's 5TB/day Log Platform from Manticore to Clickhouse...and the lessons learnt

Submitted Apr 30, 2026

Session type - select the format for your session: 15-minute talk – focused engineering experience

Description

We sized the storage layer for Nutanix’s Panacea.AI platform — 5 TB and 5 billion log lines a day — three different ways and got three answers an order of magnitude apart. Same workload, same retention, same ingest rate; the engines disagreed on storage by 37× and on CPU by 3×.

Sizing for 5 TB/day, 30-day retention	Disk	RAM	CPU cores	Compression	Ingest
Inverted-index, heavy	750 TB	24 TB	1,200+	1.5 : 1	~25k rows/s
Inverted-index, lean	500 TB	10 TB	800	1 : 1	~50k rows/s
Columnar (what we ship)	20 TB	4 TB	400	10 : 1	870k+ rows/s

This 15-minute talk is the engineering case for why that gap exists, and what it takes to land on the small end of it in production.

We’ll cover why log analytics is unusually well-suited to a columnar layout — access patterns that are almost always per-bundle and time-windowed, compression headroom on raw log messages, and the operational economics that fall out of those two — and the four schema patterns that did the real work in production:

Schema pattern	What it does	What it bought us
`Delta+ZSTD` codec stacking	Stacks delta encoding under ZSTD on log columns	9.76 TB raw → 480 GB on disk (20.8×); up to 993× on monotonic IDs
`LowCardinality(String)`	Dictionary-encodes high-frequency strings (levels, hosts, services)	Smaller marks, faster filters, better cache hit rate
`tokenbf_v1` skip indexes	Bloom-filter-based skip indexes on tokenized log text	Replaces full-text indexing for substring search
Monthly partitions + `ttl_only_drop_parts=1`	Drops whole parts on TTL instead of mutating	Self-maintaining cluster, no DBA

The cluster today holds 154 billion rows across logs, metrics, traces, and AI-generated incident reports, sustains 870k+ inserts/sec on a single node, and runs without a dedicated DBA. We’ll close with the one search-latency trade-off we accepted to land here, what it cost, what it didn’t, and the framework we now use to re-evaluate it quarterly.

Key Takeaways

Engine choice is a sizing decision, not a feature decision. The same 5 TB/day workload sized at 750 TB, 500 TB, or 20 TB depending on the storage model — the table above is the artifact you actually defend in a design review.
Why log analytics fits a columnar engine — access patterns, compression headroom, operational economics — together with the four production schema patterns (Delta+ZSTD, LowCardinality, tokenbf_v1, partition-aligned TTLs) that delivered the 37× disk reduction.
The trade-off we accepted: substring search latency moved from inverted-index speed to bloom-filter speed on a query class used in <15% of sessions. We’ll show the query-mix that made the call defensible.

Target Audience

SREs, platform engineers, and DBAs operating high-volume log telemetry. Engineering managers and architects evaluating storage engines for petabyte-class observability, log search, or AI workloads. Anyone running an inverted-index backend today who suspects the access patterns of their workload have outgrown the model they started with.

Bio

Sohham Seal — SDE-2 at Nutanix on the Panacea AI platform; works on the columnar ingestion and query layer that powers AI-driven incident triage across Nutanix’s customer fleet. His recent AI-related work includes GNN-based recommendation systems, biometric security, and EMG-based gesture prediction (IEEE TIFS, PCEMS Best Paper). Will present in Bangalore.

Mohit Gurnani — SDE-4 at Nutanix; architects and leads Panacea.AI, an agentic AI platform processing 20 PB+ of observability data annually across 29,000+ enterprise clusters — ClickHouse for high-cardinality analytics, Kubernetes-native event-driven pipelines, and LangGraph agents on top. IEEE-published author; previously presented at ICDMAI 2017. Designed the migration described in this talk. Will join Q&A virtually from Columbus, OH.

All submissions

Previous Next

Comments

Jun 2026

8 Mon

9 Tue

10 Wed

11 Thu

12 Fri 09:00 AM – 06:00 PM IST

13 Sat 09:00 AM – 06:00 PM IST

14 Sun

Get your hybrid access ticket

Hosted by

Rootconf

We care about site reliability, cloud costs, security and data privacy

Topical Edition on Databases

Migrating Panacea.AI's 5TB/day Log Platform from Manticore to Clickhouse...and the lessons learnt

Description

Key Takeaways

Target Audience

Bio

Comments