The Fifth Elephant 2012

Finding the elephant in the data.

What are your users doing on your website or in your store? How do you turn the piles of data your organization generates into actionable information? Where do you get complementary data to make yours more comprehensive? What tech, and what techniques?

The Fifth Elephant is a two day conference on big data.

Early Geek tickets are available from

The proposal funnel below will enable you to submit a session and vote on proposed sessions. It is a good practice introduce yourself and share details about your work as well as the subject of your talk while proposing a session.

Each community member can vote for or against a talk. A vote from each member of the Editorial Panel is equivalent to two community votes. Both types of votes will be considered for final speaker selection.

It’s useful to keep a few guidelines in mind while submitting proposals:

  1. Describe how to use something that is available under a liberal open source license. Participants can use this without having to pay you anything.

  2. Tell a story of how you did something. If it involves commercial tools, please explain why they made sense.

  3. Buy a slot to pitch whatever commercial tool you are backing.

Speakers will get a free ticket to both days of the event. Proposers whose talks are not on the final schedule will be able to purchase tickets at the Early Geek price of Rs. 1800.

Hosted by

All about data science and machine learning

prashant singh


Managing Data on Hadoop

Submitted Jun 6, 2012

The paper talks about an approach on how to manage high volume data movement on hadoop, making it available for processing in Yahoo!. As part of grid data management, we load Terabytes of data daily onto hadoop clusters and replicate the same to BCP clusters. As part of this tech talk, we want to share our experiences, challenges and techniques of high volume data movement on hdfs.


It is crucial for web applications to mine data generated from different logs to get relevant information and trending for research and development projects and for a growing number of production processes across Yahoo!.
This lecture will focus on the challenges we face to manage large volume of data movement across hadoop clusters, within strict SLAs and prioritizing the data flow based on its importance at Yahoo!.


Knowledge of Hadoop

Speaker bio

Prashant K Singh works at Yahoo! as a Principal Engineer and handles data management and hadoop operations. As part of this team, he manages around 20 hadoop clusters with ~40K nodes with 300+ PB of data with a total cluster capacity of ~1 Exabyte.

Prior to Yahoo! Prashant has worked with MakeMyTrip, where he was responsible for setting up data center activities to in house and migrating the webportal from a windows platform to open source platform and making it stable and more capable to handle large amount of user traffic.

Abhishek Dan manages the hadoop service engineering team at Yahoo! which is responsible for hadoop cluster management and data management on hadoop clusters.


{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

All about data science and machine learning