The Fifth Elephant 2012

Finding the elephant in the data.

What are your users doing on your website or in your store? How do you turn the piles of data your organization generates into actionable information? Where do you get complementary data to make yours more comprehensive? What tech, and what techniques?

The Fifth Elephant is a two day conference on big data.

Early Geek tickets are available from fifthelephant.doattend.com.

The proposal funnel below will enable you to submit a session and vote on proposed sessions. It is a good practice introduce yourself and share details about your work as well as the subject of your talk while proposing a session.

Each community member can vote for or against a talk. A vote from each member of the Editorial Panel is equivalent to two community votes. Both types of votes will be considered for final speaker selection.

It’s useful to keep a few guidelines in mind while submitting proposals:

  1. Describe how to use something that is available under a liberal open source license. Participants can use this without having to pay you anything.

  2. Tell a story of how you did something. If it involves commercial tools, please explain why they made sense.

  3. Buy a slot to pitch whatever commercial tool you are backing.

Speakers will get a free ticket to both days of the event. Proposers whose talks are not on the final schedule will be able to purchase tickets at the Early Geek price of Rs. 1800.

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

Piyush Goel

@pigol1

Making sense out of semi-structured log data

Submitted Jun 29, 2012

Discuss how to use big-data systems for analyzing Log data in real time.

Outline

Any complex OLTP system comprises of many distributed modules running on separate nodes. Failure of a single module or a bug in any one can cause a request to fail completely or respond with incorrect data. Debug and error level logs for each module helps developers to troubleshoot such issues. But for a large scale distributed system collecting the log files from all nodes and analyzing them is a cumbersome task because of the unstructured/semi-structured nature of the log entries.

To address the above challenge, the platforms team at Capillary Technologies has developed a Log Processing Framework which helps us understand and analyze all the logs at a single place. The framework has been built using Flume, MongoDB, Hive, Hadoop and custom components which provides us the capability to analyze large amounts of log data using Map Reduce Jobs. The framework enables us to search the exact causes of failure and performance metrics from large amounts of raw text using SQL like interface in near realtime.

This talk will focus on the design decisions that were made and the challenges that we encountered while building this framework.

Requirements

Basic understanding of MongoDB, Hive/Hadoop.

Speaker bio

Pravanjan Choudhury is an Architect with Platforms & Applications group at Capillary Technologies. He has more than 9 years of experience in building mission-critical large scale software & cloud based products. Previously, he was an Architect in Minekey, a silicon valley start-up and in the past has done consulting for several research projects of National Semiconductor.

Piyush Goel is Team Lead with the Platforms group at Capillary Technologies. Before joining Capillary he worked at Yahoo! Bangalore as Senior Software Engineer with the Emerging Markets team. He holds a BTech & MTech in Computer Science from Indian Institute of Technology, Kharagpur.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more