Speak at The Fifth Elephant 2026 Annual Conference
Share you work with the community
Jul 2026
13 Mon
14 Tue
15 Wed
16 Thu
17 Fri 09:00 AM – 06:00 PM IST
18 Sat 09:00 AM – 06:00 PM IST
19 Sun
Pushpendra Singh Chauhan
Submitted Jun 13, 2026
Most data platforms don’t fail because the tables are wrong — they fail because nobody can find the table, nobody knows who owns it, and every consumer is hard-wired to a physical GCS/blob path that breaks the moment a bucket or partition layout changes. At InMobi, our data landscape grew organically into exactly this: file-and-path datasets with near-zero discoverability, access tightly coupled to storage conventions, governance applied to only a small fraction of datasets, and a mix of Delta and raw Parquet that multiplied cognitive overhead. On top of that, multiple business units — DSP and SSP — ran independent lakehouses, catalogs, and data-engineering teams, so the same problems existed several times over.
We addressed this with a Table-First Architecture: the logical table becomes the only interface anyone touches. Consumers query catalog.namespace.table and never reason about file paths, partition schemes, or storage classes again. The platform is built on Apache Iceberg as the table format, Apache Polaris as a centralized Iceberg REST catalog, GCS for storage, and OpenMetadata as the discovery-and-governance meta-catalog — with compute (Spark on Kubernetes and Databricks, Trino) cleanly decoupled from storage and catalog so each layer can scale and evolve independently.
The interesting part is not the target architecture — it’s getting there on a live, multi-petabyte, multi-BU platform without a big-bang migration and without ever risking production. This talk is the field report of that journey: how we register existing files into Iceberg in place (no data copy), keep source and Iceberg consistent during a long read-only phase, then perform a deliberate, reversible, operator-gated cutover so the Spark job itself becomes the single writer to the table. We’ll walk through the adoption model end to end and the operating model — Sync, Alerting & Monitoring (AMS), and Reconciliation — that makes a migration of this size observable, recoverable, and safe enough to run during business hours.
A logical-table abstraction as the single interface across many BUs. How decoupling compute, storage, catalog, and meta-catalog lets you put one consistent catalog.namespace.table interface in front of heterogeneous sources (Delta, Parquet, file-based) — and why “abstraction over implementation” and “governance by design” are architectural decisions, not documentation exercises.
A reversible, incremental adoption pattern instead of a big-bang cutover. Register files in place to onboard read-only Iceberg, feed it from source via a post-write Sync task, monitor freshness against SLA, then cut over to direct Iceberg writes one dataset at a time — with the “exactly one writer per table” invariant, a real rollback path (source pipeline kept intact + Iceberg snapshot rollback), and a watch window before a dataset is declared write-enabled.
The operating model that makes a live migration safe. The three pillars we’d build first if we did it again: Sync for source↔Iceberg consistency under a ≤15-min SLA, AMS for a single source of truth on onboarding progress and freshness alerting, and Reconciliation for gap detection/backfill and lifecycle management — plus an explicit Platform-vs-BU RACI so cutover and rollback are BU-approved, Platform-executed.
Hard-won lessons running Delta + Parquet + file-based sources under one Iceberg catalog. Where file registration beats data copying, where Apache XTable earns its place for Delta-to-Iceberg conversion, the “delete from Iceberg before the source” ordering rule that prevents 404s during retention, and the developer-experience tooling (a sampled catalogue for minutes-not-hours local validation) that turned cutovers from scary into routine.
Data Engineers, Data Platform Engineers, Data Architects, and Engineering Leaders who are building — or are about to migrate toward — lakehouse and catalog platforms. It will be most valuable to teams facing a large, fragmented, multi-team data estate they need to move to Apache Iceberg + a REST catalog without downtime, and to anyone working with Iceberg, Polaris, Trino, Spark, OpenMetadata, and Airflow. Attendees don’t need prior Iceberg experience, but should be comfortable with the basic mechanics of a data lake.
Pushpendra is a Staff Data Engineer on InMobi’s Center of Excellence (CoE) Data Platform team, leading the Table-First Architecture initiative across the InMobi’s business units. He works at the intersection of platform engineering, data governance, and technical leadership — building InMobi’s Iceberg/Polaris-based lakehouse, driving cross-BU adoption, and mentoring the engineers who run it in production.
Saksham Ratra is a Senior Data Engineer on InMobi’s Center of Excellence (CoE) Data Platform team, where he focuses on building scalable data infrastructure and improving data discoverability across the organization. He has been instrumental in implementing the Table-First Architecture, helping to onboard multiple datasets into the new Iceberg-based lakehouse, Designing and implementing shopistricated Alerting and monitoring system and ensuring smooth cutovers with minimal disruption to production workloads.
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}