Expressing complex ETL pipelines using Cascading

Jul 2018

23 Mon

24 Tue

25 Wed

26 Thu 07:45 AM – 06:15 PM IST

27 Fri 07:45 AM – 05:35 PM IST

28 Sat

29 Sun

NIMHANS Convention Centre, Bengaluru

Expressing complex ETL pipelines using Cascading

Submitted Mar 31, 2018

Section: Crisp talk Technical level: Beginner

At Flipkart, data is one of the differentiators and is used in innumerable ways for decision making. Specifically, for generating recommendations, our data pipelines performs various ETL operations over terabytes of user activity data.

To begin with, raw MapReduce gave us granular control over our pipelines but required a lot of boilerplate code for performing joins and aggregations that constituted the building blocks of our ETL flows.

Cascading is an abstraction over MapReduce and provides higher level API for data-processing workflows. It is used to create and execute complex data processing workflows on a hadoop cluster, hiding the underlying complexity of MapReduce jobs. Few benefits of Cascading over raw MR can be listed as :
a) provides faster iterations
b) reusable components
c) instrumentation as a first class citizen
d) expression of ETL DAG elegantly
e) testability and robustness

Outline

In this talk I will be covering :
a) Learnings from migrating to Cascading from raw MR
b) How does cascading stack against other workflow orchestrators
c) Achieving clear segregation between I/O adapters, ETL operations and business logic
d) Some lesser known aspects of cascading

Speaker bio

Neha is a Software Developer with Recommendation team in Flipkart. Previously, she has worked with Finomena, startup in fintech domain. She has experience in designing large scale data processing and ETL pipelines. She is a blockchain enthusiast and an avid blogger. She has graduated from IIT BHU, Varanasi.

Slides

https://docs.google.com/presentation/d/1S4FPYMNbrXYzhhGMgQJxAw_RhLpT86PpfFANFXN54Ok/edit?usp=sharing

The Fifth Elephant 2018