The Missing Half of AI Data Assistants: A REPL for Pipelines

Jul 2026

27 Mon

28 Tue

29 Wed

30 Thu

31 Fri 09:00 AM – 06:00 PM IST

1 Sat

2 Sun

The Missing Half of AI Data Assistants: A REPL for Pipelines

Submitted Jun 25, 2026

I am submitting for: Track 2 - Building & implementing AI tools & agents in production Type of session: 30 mins talk

How Nexus-AI closes the loop on AI-generated pipelines — running the real job and proving the output is correct.

Description

Give the same language model to two engineers and you get two very different outcomes. The software engineer drops it into their editor and flies — a type checker and a unit test confirm every change in seconds. The data engineer tries the same thing, and it’s quietly terrifying. Same model — the difference is the feedback loop. In data engineering there isn’t a fast one. A single change means a PR, a built jar or image, a repointed DAG, and 40 minutes of waiting on Airflow to tell you it failed. The failures that slip through are worse, because they’re silent: an AI swaps two columns in a UNION ALL, the code compiles, the row counts match, and every row in production lands wrong — with your name on the approval. Today’s data copilots — dbt Copilot, Databricks Genie, Snowflake Cortex — are genuinely good at generation. But they verify, if at all, by running queries against your warehouse. Every iteration is a billed scan over hundreds of gigabytes, and agents iterate constantly. Before long, you are paying your warehouse to watch an AI guess. AI didn’t make data engineering dangerous or expensive — the missing loop did. Nexus-AI, built entirely on open source (Iceberg, dbt, Spark, Polaris, OpenMetadata), rebuilds that loop to be fast, local, and cheap.

The centerpiece is Pinbox, a REPL for data pipelines. It samples a terabyte input down to a few megabytes, provisions the exact runtime in a hermetic container, and runs your real, unmodified PySpark / Scala / dbt job on your laptop in seconds — no cluster, no warehouse bill. The interesting part is what you do with the output, and there are two modes. Assert: you declare the rules the output must hold — row counts, null rates, a custom SQL predicate like “revenue is never negative” — and the run passes or fails against them, even for a brand-new pipeline with nothing to compare to. Compare: run a baseline and a candidate, and diff them against criteria you define — tolerant schema, key-based join or full multiset, your own per-column tolerance. Sampling is partition- and key-aware, and both runs hit the same slice, so a pass means the logic is identical on that data; anything that depends on full-data distribution is flagged for a separate full-scale check, not assumed safe. The whole thing runs on one rule — AI orchestrates, scripts enforce — so the model generates freely but never gets to decide what “correct” means. We’ve pushed it hardest on migration — 400+ production Spark pipelines to Iceberg + dbt — cutting each migration from a week to an afternoon.

Takeaways

Move the verification loop off the warehouse. Today’s copilots verify AI output by running queries on production compute, so the cost scales with how often the agent iterates — and agents iterate constantly. Run the loop on a sampled local slice instead, and verification cost collapses toward zero.
Let the AI generate freely, but never let it be the judge. Trust comes from a deterministic check that proves the output is correct before anything ships — not from a better prompt. The proof is the product.

Audiences

Data and analytics engineers now accountable for reviewing and signing off on AI-generated transformation code they didn’t write.
Software and platform engineers on data teams who have a tight inner loop in application code and want the same for pipelines.

Bio

Purushotham Pururava Pushpavanth is an SDE-4, Data platform at InMobi, he heads Nexus-AI (formerly Data DevX) — an AI data-engineering assistant with a laptop-local verification loop (Pinbox). An open-source contributor to Apache Hudi, Debezium, and NiFi, he has spent a decade building large-scale data and stream-processing platforms, and now works at the intersection of lakehouse infrastructure and agentic tooling.

https://docs.google.com/presentation/d/15LXrHgyTXGs3tm8CAzOEJD_wYVCjDr6L/edit?usp=sharing&ouid=105771402084879963156&rtpof=true&sd=true

{Add the link to 2-min elevator pitch video}

Speak at The Fifth Elephant 2026 Annual Conference