The Fifth Elephant 2025 Annual Conference CfP
Speak at The Fifth Elephant 2025 Annual Conference
Chintan Betrabet
@chintanbuber
Submitted May 31, 2025
Migrating data and compute from on-premises to public cloud is a complex undertaking. This complexity is significantly increased when faced with the scale of hundreds of Petabytes of data and half a million workloads necessary to power business intelligence and maintain a competitive advantage. The migration of data workloads and storage to the cloud has been a multi-year initiative at Uber, during which we are operating in a hybrid environment between the cloud and on-premises.
The Uber batch data platform is used by thousands of engineers, analysts as well as city operation teams across the globe to power batch and real time data processing. It uses both open source technologies such as Presto, Apache Spark, Pinot, Flink and Kafka while also maintaining its customised solutions, for instance we have a workflow orchestrator similar to Apache Airflow, experimentation notebooks similar Jupyter.
This talk will discuss the key challenges we faced in migration of our batch data stack to the cloud and cover the tooling we built to orchestrate such a large scale migration. Even the smallest of issues in data correctness can have catastrophic business impact, so we will also highlight how we have helped guarantee data correctness before and after migration.
We will also talk about some of the tradeoffs we made, such as when to replicate data to ensure high availability of data including and when to read data across the network from a single primary source.
Even before moving any data or compute resources, we worked on the right tooling to identify candidate workloads and datasets incrementally while ensuring minimum disruption to users. Considering the scale and blast radius of any issues in the migration, we built robust automation to detect issues during, or just after, any workload is migrated, and perform automated rollbacks. Throughout the migration, we used a combination of replication across cloud and on-prem, and remote data access to ensure data availability for consumers.
We also built abstraction layers that would work as intelligent proxies for any storage/compute client calls to route them to on-premise or cloud, depending on where the data is available and ready for consumption. This makes any data movement performed by platform teams to the cloud transparent to users.
To decouple storage and compute migration, we enabled incremental replication of data across on-prem and cloud to allow consumer workloads to run in either environment. This brought in problems of eventual consistency and potential for data corruption due to conflicting writes made by replication and scheduled pipelines writers (running on Uber’s custom workflow orchestrator built on Apache Airflow). We will talk in detail about how we tackled such challenges using a centralized observability service that tracks the availability and consistency of data across all its copies.
Mind map (draft slides: https://docs.google.com/presentation/d/1wZA_N60_Gt1vTkKPYhcoNJ4pNM9WdoywTuVBHmLLiZ0/edit?slide=id.g361fb61cc07_2_100#slide=id.g361fb61cc07_2_100)
We will briefly cover the batch data stack on-premise, comparing it with the setup during and post the migration to the cloud
The complexity involved in the migration, including scale, complex data usage patterns, constraints in data replication and availability of duplicated compute and storage resources.
The process of selection of migration candidates and the tooling and automation used to monitor migrations and the guardrails in place to roll back a migration operation in case there is unexpected behaviour of data workflows
The role and responsibilities of the service acting as the SOT and has been at the heart of migration, covering all the integration points needed to make this service have near real time data on consistency and availability of data produced in the midst of an ongoing migration.
Eventual consistency of source data can lead to consumers reading data before it is completely replicated from it primary region. This can have a cascading impact as it could lead to empty or incomplete data being processed and flow down to the business intelligence systems, which could end up taking incorrect decisions. We will cover how we are re-scheduling such data reads in data workflows.
Finally, we will cover the key tradeoffs we made, including:
and conclude with some key learnings throughout the two-year migration journey so far.
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}