The Fifth Elephant

The Fifth Elephant 2025 Annual Conference

Less hype. More engineering.

Jul 2025

14 Mon

15 Tue

16 Wed

17 Thu

18 Fri

19 Sat 08:45 AM – 05:55 PM IST

20 Sun

Bangalore International Centre, Bangalore

Tickets

All submissions

Previous Next

This submission has been added to the schedule

Operating Uber's batch data platform in a hybrid cloud environment

Submitted May 31, 2025

Choose the topic your submission falls under: Big data in enterprises I am submitting for: Speaking at the Fifth Elephant 2025 Annual Conference Type of submission: 30 mins talk

Summary

Migrating data and compute from on-premises to public cloud is a complex undertaking. This complexity is significantly increased when faced with the scale of hundreds of Petabytes of data and half a million workloads necessary to power business intelligence and maintain a competitive advantage. The migration of data workloads and storage to the cloud has been a multi-year initiative at Uber, during which we are operating in a hybrid environment between the cloud and on-premises.

The Uber batch data platform is used by thousands of engineers, analysts as well as city operation teams across the globe to power batch and real time data processing. It uses both open source technologies such as Presto, Apache Spark, Pinot, Flink and Kafka while also maintaining its customised solutions, for instance we have a workflow orchestrator similar to Apache Airflow, experimentation notebooks similar Jupyter.

This talk will discuss the key challenges we faced in migration of our batch data stack to the cloud and cover the tooling we built to orchestrate such a large scale migration. Even the smallest of issues in data correctness can have catastrophic business impact, so we will also highlight how we have helped guarantee data correctness before and after migration.

We will also talk about some of the tradeoffs we made, such as when to replicate data to ensure high availability of data including and when to read data across the network from a single primary source.

Migration tooling

Even before moving any data or compute resources, we worked on the right tooling to identify candidate workloads and datasets incrementally while ensuring minimum disruption to users. Considering the scale and blast radius of any issues in the migration, we built robust automation to detect issues during, or just after, any workload is migrated, and perform automated rollbacks. Throughout the migration, we used a combination of replication across cloud and on-prem, and remote data access to ensure data availability for consumers.

We also built abstraction layers that would work as intelligent proxies for any storage/compute client calls to route them to on-premise or cloud, depending on where the data is available and ready for consumption. This makes any data movement performed by platform teams to the cloud transparent to users.

Data access and integrity challenges in a hybrid environment

To decouple storage and compute migration, we enabled incremental replication of data across on-prem and cloud to allow consumer workloads to run in either environment. This brought in problems of eventual consistency and potential for data corruption due to conflicting writes made by replication and scheduled pipelines writers (running on Uber’s custom workflow orchestrator built on Apache Airflow). We will talk in detail about how we tackled such challenges using a centralized observability service that tracks the availability and consistency of data across all its copies.

Mind map (draft slides: https://docs.google.com/presentation/d/1wZA_N60_Gt1vTkKPYhcoNJ4pNM9WdoywTuVBHmLLiZ0/edit?slide=id.g361fb61cc07_2_100#slide=id.g361fb61cc07_2_100)

Outline:

Introduction and Architectural Overview:

We will briefly cover the batch data stack on-premise, comparing it with the setup during and post the migration to the cloud

Key Challenges & Constraints

The complexity involved in the migration, including scale, complex data usage patterns, constraints in data replication and availability of duplicated compute and storage resources.

High level migration strategy:

The process of selection of migration candidates and the tooling and automation used to monitor migrations and the guardrails in place to roll back a migration operation in case there is unexpected behaviour of data workflows

Central migration orchestration service:

The role and responsibilities of the service acting as the SOT and has been at the heart of migration, covering all the integration points needed to make this service have near real time data on consistency and availability of data produced in the midst of an ongoing migration.

Preventing data corruption and processing incomplete data:

Eventual consistency of source data can lead to consumers reading data before it is completely replicated from it primary region. This can have a cascading impact as it could lead to empty or incomplete data being processed and flow down to the business intelligence systems, which could end up taking incorrect decisions. We will cover how we are re-scheduling such data reads in data workflows.

Tradeoffs and conclusions:

Finally, we will cover the key tradeoffs we made, including:

Data replication vs remote data access
Need for data completeness with tolerance for delay vs SLA requirements with tolerance for eventual consistency
Speed vs Safety of migration

and conclude with some key learnings throughout the two-year migration journey so far.

All submissions

Previous Next

Comments

Jul 2025

14 Mon

15 Tue

16 Wed

17 Thu

18 Fri

19 Sat 08:45 AM – 05:55 PM IST

20 Sun

Hybrid Access Ticket

Hosted by

The Fifth Elephant

Jumpstart better data engineering and AI futures

Supported by

Gold sponsor

Sahaj Software

Sahaj is an artisanal technology services company crafting purpose-built AI and data-led solutions for businesses.

Gold sponsor

Atlassian

Atlassian unleashes the potential of every team. Our agile & DevOps, IT service management and work management software helps teams organize, discuss, and compl

Gold sponsor