Out of Stone age : Why investing in developer tools is necessary for big data development to scale.
Submitted by Shankar Manian (@shanm) on Saturday, 29 April 2017
Full talk for data engineering track
Do you wish hadoop development was as easy as any other application development ? Do you wish we had comprehensive tools that are well-integrated with each other for hadoop development ?
At linkedin, we have 1000s of nodes spread across multiple clusters. We have 1000s of active users who use the cluster on an ongoing basis and 100s of flows that runs on a regular schedule powering the data to our site. We needed a development ecosystem that spans the entire development lifecycle that can handle this scale. From authoring and testing hadoop/spark jobs to debugging and monitoring them in production, we are making development of hadoop jobs easier and intuitive.
In this talk, we will discuss about the motivation behind such an approach and the solution we are building. We are leveraging successful technologies like samza, kafka, hadoop dsl and dr. elephant as well building brand new tools for testing, monitoring and debugging.
I will start by describing the state of hadoop development at linkedin a few years and the challenges faced as our operational scale increased massively. I will then compare the development ecosystem that exists for hadoop both at linkedin and the overall community in general with the ecosystem that exists for other areas like development of online services. I will draw attention to the key drawbacks that we identified and also talk about the unique challenges faced in hadoop/spark development. I will talk about each of SDLC areas in terms of their challenges and the solutions we came up for each of them. I will highlight how we leveraged existing technologies where available and filled the gaps with new ones. I will also talk about how these are tightly integrated with each other through well defined interfaces and components. In conclusion, I will talk about potential benefits of such a system and future plans on where we can take it further.
Shankar has 17+ years of experience building distributed systems and productivity tools. He started out building a highly successful distributed test automation for windows and bing in microsoft. Then he spent the 8 years help build a middle tier platform that powered most of the online services that formed the backbone of bing and microsoft ads. He is currently leading the grid productivity team in bangalore. Empowering hadoop developers at linkedin to be more productive with their time and cluster resources.