Building Complex Data Workflows with Cascading on Hadoop
Submitted by Gagan Agrawal (@gagana24) on Saturday, 13 June 2015
Section: Full Talk Technical level: Intermediate
Understand how to build complex data workflow pipelines with cascading on hadoop by taking inputs from different sources and pushing crunched data to different sinks.
Big data processing often requires reading data from multiple sources like log files, databases, nosql stores, external services etc and performing transformations and defining complex workflow pipe lines to get some useful insight out of it. Writing these complex steps in Hadoop’s Map Reduce can be a non-trivial job and will require lot of effort and expertise to get it done. There are high level languages like Pig latin or Hive which make writing Map Reduce jobs easy. But if you want to write complex logic in these languages, you need to write custom functions in Java, which makes testing and debugging difficult. This is where Cascading makes developer’s life easy. Everything is written in Java with ease of writing Map Reduce in high level language similar to SQL interfaces. Once logic has been written, it can be easily be tested by running in stand-alone mode or Junit test cases since everything is in java. Not only that, Cascading provides Hadoop(or any other framework) agnostic APIs, which means workflows written in Cascading can be executed on multiple frameworks without any code change as long as Cascading connector is available. In this session, I will introduce Cascading framework and it’s features and discuss some real world use cases where complex workflows can be easily developed in Cascading. Below is the agenda of the talk.
–What is cascading –Building complex data flows in Cascading –Testing with Cascading –Multiple examples to demonstrate ease of writing complex workflows –Real world use case –Advantages / Disadvantages
Gagan Agarwal is a Sr. Principal Engineer at Snapdeal and is currently heading Personalization and Recommendation team at Snapdeal. He has close to 10 years of experience in Software industry and have worked in domains like e-commerce, digital advertising, e-Governance, Document and Content Management, Customer Communication Management, Media Buy Management etc. Gagan has worked and developed challenging softwares ranging from multi-tiered Web Applications with millions of users to batch processing of multi tera byte data. Apart from expertise in Java/JEE technologies, Gagan has been working with Big Data technologies like Hadoop, Spark, Cascading, Pig, Hive, Sqoop, Oozie, Kafka etc. and nosql stores like Hbase, Cassandra, Aerospike, Mongo, Neo4j etc for past several years. Gagan is a seasoned speaker and has spoken on several technology conferences on topics ranging from Big Data Processing, No SQL Stores (key-value, graph based, column oriented stores) to functional programming languages.