Expressing complex ETL pipelines using Cascading
Submitted by Neha Kumari (@neha-kumari) on Saturday, 31 March 2018
At Flipkart, data is one of the differentiators and is used in innumerable ways for decision making. Specifically, for generating recommendations, our data pipelines performs various ETL operations over terabytes of user activity data.
To begin with, raw MapReduce gave us granular control over our pipelines but required a lot of boilerplate code for performing joins and aggregations that constituted the building blocks of our ETL flows.
Cascading is an abstraction over MapReduce and provides higher level API for data-processing workflows. It is used to create and execute complex data processing workflows on a hadoop cluster, hiding the underlying complexity of MapReduce jobs. Few benefits of Cascading over raw MR can be listed as :
a) provides faster iterations
b) reusable components
c) instrumentation as a first class citizen
d) expression of ETL DAG elegantly
e) testability and robustness
In this talk I will be covering :
a) Learnings from migrating to Cascading from raw MR
b) How does cascading stack against other workflow orchestrators
c) Achieving clear segregation between I/O adapters, ETL operations and business logic
d) Some lesser known aspects of cascading
Neha is a Software Developer with Recommendation team in Flipkart. Previously, she has worked with Finomena, startup in fintech domain. She has experience in designing large scale data processing and ETL pipelines. She is a blockchain enthusiast and an avid blogger. She has graduated from IIT BHU, Varanasi.