The Fifth Elephant 2016

India's most renowned data science conference

Akash Mishra

@sleepythread

Don’t just build a data lake, build data powerhouse.

Submitted Jun 13, 2016

Companies are now trying to become data oriented and trying to take decision based on data.

First step in moving towards data oriented decision is to collect data. Data Lake has become one of the recent buzz word in Big Data industry. Most of the time companies try to first build a Data Lake which will contain all their data. Most often dumping data into data lake translate into exporting all the data from various RDBMS databases [e.g Orders, Inventory], scraping all the log’s into their data lake. Once we have all the relevant data in data lake, we write various processing applications to extract data out of the source data. Above approach has many problems [e.g. huge upfront cost, missing information not currently tracked e.t.c] associated with it.

In this talk I will be proposing another approach for data driven system where instead of dumping all the data into central location, we identify the events/interactions/facts [ e.g Add to card event, Viewing a product e.t.c] in the company and store them for processing. I will be explaining how this approach becomes much more result oriented and much more agile than the dumping approach.

Outline

  • Data Lake, the traditional way :
    • Explains some current architecture to build data lake.
    • Problems associated with the approach.
    • Real Life Example.
  • What is events/interactions/facts?
    • Explaining terminology.
    • Reason to track them.
    • Defining
      • Business event
      • Developer events
      • Monitoring events
  • Use Case Driven Development:
  • Proposed Architecture:
  • Benefits of proposed Architecture.
    • Business Stakeholder
    • Developer
    • Monitoring.

Speaker bio

Akash Mishra is currently working as a Data Engineer at Badoo Trading Limited with more than 4 years experience in building large scale big data application for various client of ThoughtWorks Technologies. He has production experience with various big data technologies like Spark,Hadoop, Mesos e.t.c. He is passionate developer and has deep interest in Distributed Systems. He has co-organised Big Data Meetup for Pune & NCR. He has already given various talks in meetups and Geek Night & contributed to Apache Spark project.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures