Managing Data on Hadoop
Submitted by prashant singh (@prashantkr2002) on Wednesday, 6 June 2012
Section: Big Data Infrastructure & Processing Technical level: Intermediate Session type: Lecture
The paper talks about an approach on how to manage high volume data movement on hadoop, making it available for processing in Yahoo!. As part of grid data management, we load Terabytes of data daily onto hadoop clusters and replicate the same to BCP clusters. As part of this tech talk, we want to share our experiences, challenges and techniques of high volume data movement on hdfs.
It is crucial for web applications to mine data generated from different logs to get relevant information and trending for research and development projects and for a growing number of production processes across Yahoo!. This lecture will focus on the challenges we face to manage large volume of data movement across hadoop clusters, within strict SLAs and prioritizing the data flow based on its importance at Yahoo!.
Knowledge of Hadoop
Prashant K Singh works at Yahoo! as a Principal Engineer and handles data management and hadoop operations. As part of this team, he manages around 20 hadoop clusters with ~40K nodes with 300+ PB of data with a total cluster capacity of ~1 Exabyte.
Prior to Yahoo! Prashant has worked with MakeMyTrip, where he was responsible for setting up data center activities to in house and migrating the webportal from a windows platform to open source platform and making it stable and more capable to handle large amount of user traffic.
Abhishek Dan manages the hadoop service engineering team at Yahoo! which is responsible for hadoop cluster management and data management on hadoop clusters.