The Fifth Elephant 2017

On data engineering and application of ML in diverse domains

Out of Stone age : Why investing in developer tools is necessary for big data development to scale.

Submitted by Shankar Manian (@shanm) on Saturday, 29 April 2017

videocam
Preview video

Technical level

Intermediate

Section

Full talk for data engineering track

Status

Submitted

Vote on this proposal

Login to vote

Total votes:  +11

Abstract

Do you wish hadoop development was as easy as any other application development ? Do you wish we had comprehensive tools that are well-integrated with each other for hadoop development ?
At linkedin, we have 1000s of nodes spread across multiple clusters. We have 1000s of active users who use the cluster on an ongoing basis and 100s of flows that runs on a regular schedule powering the data to our site. We needed a development ecosystem that spans the entire development lifecycle that can handle this scale. From authoring and testing hadoop/spark jobs to debugging and monitoring them in production, we are making development of hadoop jobs easier and intuitive.
In this talk, we will discuss about the motivation behind such an approach and the solution we are building. We are leveraging successful technologies like samza, kafka, hadoop dsl and dr. elephant as well building brand new tools for testing, monitoring and debugging.

Outline

I will start by describing the state of hadoop development at linkedin a few years and the challenges faced as our operational scale increased massively. I will then compare the development ecosystem that exists for hadoop both at linkedin and the overall community in general with the ecosystem that exists for other areas like development of online services. I will draw attention to the key drawbacks that we identified and also talk about the unique challenges faced in hadoop/spark development. I will talk about each of SDLC areas in terms of their challenges and the solutions we came up for each of them. I will highlight how we leveraged existing technologies where available and filled the gaps with new ones. I will also talk about how these are tightly integrated with each other through well defined interfaces and components. In conclusion, I will talk about potential benefits of such a system and future plans on where we can take it further.

Speaker bio

Shankar has 17+ years of experience building distributed systems and productivity tools. He started out building a highly successful distributed test automation for windows and bing in microsoft. Then he spent the 8 years help build a middle tier platform that powered most of the online services that formed the backbone of bing and microsoft ads. He is currently leading the grid productivity team in bangalore. Empowering hadoop developers at linkedin to be more productive with their time and cluster resources.

Slides

https://docs.google.com/presentation/d/1yyZbVXP-oaYFaNkx6Ia__In_Vifj-UVlodgfQ_XUpQ8/edit?usp=sharing

Preview video

https://www.youtube.com/watch?v=Vttx6hedC2E

Comments

  • 1
    Zainab Bawa (@zainabbawa) Reviewer a year ago

    Share draft slides, detailing the content you will cover + two-min preview video explaining what this talk is about and why participants should attend.

Login with Twitter or Google to leave a comment