The Fifth Elephant 2018

The Fifth Elephant 2018

The seventh edition of India's best data conference

Neha Kumari

@neha_kumari

Expressing complex ETL pipelines using Cascading

Submitted Mar 31, 2018

At Flipkart, data is one of the differentiators and is used in innumerable ways for decision making. Specifically, for generating recommendations, our data pipelines performs various ETL operations over terabytes of user activity data.

To begin with, raw MapReduce gave us granular control over our pipelines but required a lot of boilerplate code for performing joins and aggregations that constituted the building blocks of our ETL flows.

Cascading is an abstraction over MapReduce and provides higher level API for data-processing workflows. It is used to create and execute complex data processing workflows on a hadoop cluster, hiding the underlying complexity of MapReduce jobs. Few benefits of Cascading over raw MR can be listed as :
a) provides faster iterations
b) reusable components
c) instrumentation as a first class citizen
d) expression of ETL DAG elegantly
e) testability and robustness

Outline

In this talk I will be covering :
a) Learnings from migrating to Cascading from raw MR
b) How does cascading stack against other workflow orchestrators
c) Achieving clear segregation between I/O adapters, ETL operations and business logic
d) Some lesser known aspects of cascading

Speaker bio

Neha is a Software Developer with Recommendation team in Flipkart. Previously, she has worked with Finomena, startup in fintech domain. She has experience in designing large scale data processing and ETL pipelines. She is a blockchain enthusiast and an avid blogger. She has graduated from IIT BHU, Varanasi.

Slides

https://docs.google.com/presentation/d/1S4FPYMNbrXYzhhGMgQJxAw_RhLpT86PpfFANFXN54Ok/edit?usp=sharing

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures