The Fifth Elephant 2016

India's most renowned data science conference

Up next

Don’t just build a data lake, build data powerhouse.

AM

Akash Mishra

@sleepythread

Companies are now trying to become data oriented and trying to take decision based on data.

First step in moving towards data oriented decision is to collect data. Data Lake has become one of the recent buzz word in Big Data industry. Most of the time companies try to first build a Data Lake which will contain all their data. Most often dumping data into data lake translate into exporting all the data from various RDBMS databases [e.g Orders, Inventory], scraping all the log’s into their data lake. Once we have all the relevant data in data lake, we write various processing applications to extract data out of the source data. Above approach has many problems [e.g. huge upfront cost, missing information not currently tracked e.t.c] associated with it.

In this talk I will be proposing another approach for data driven system where instead of dumping all the data into central location, we identify the events/interactions/facts [ e.g Add to card event, Viewing a product e.t.c] in the company and store them for processing. I will be explaining how this approach becomes much more result oriented and much more agile than the dumping approach.

Outline

  • Data Lake, the traditional way :
    • Explains some current architecture to build data lake.
    • Problems associated with the approach.
    • Real Life Example.
  • What is events/interactions/facts?
    • Explaining terminology.
    • Reason to track them.
    • Defining
      • Business event
      • Developer events
      • Monitoring events
  • Use Case Driven Development:
  • Proposed Architecture:
  • Benefits of proposed Architecture.
    • Business Stakeholder
    • Developer
    • Monitoring.

Speaker bio

Akash Mishra is currently working as a Data Engineer at Badoo Trading Limited with more than 4 years experience in building large scale big data application for various client of ThoughtWorks Technologies. He has production experience with various big data technologies like Spark,Hadoop, Mesos e.t.c. He is passionate developer and has deep interest in Distributed Systems. He has co-organised Big Data Meetup for Pune & NCR. He has already given various talks in meetups and Geek Night & contributed to Apache Spark project.

Comments

Up next

Distributed Computing Abstractions for Big Data Science

VP

Vijay Srinivas Agneeswaran, Ph.D

The data science field has made significant advances in the last few years, with a renewed focus on getting data science to work at scale. The talk shall outline distributed computing abstractions required to realize data science at scale. The Resilient Distributed DataSet (RDD) abstraction provided by Spark is becoming a de-facto approach for big data science. However, Apache Flink and recently, Concord have emerged as interesting alternatives to Spark and provide streaming dataflow abstractions – while Spark can achieve real-time analytics by mini-batching, Flink’s allows event streaming as a first class abstraction and provides exactly once guarantees. TensorFlow also provides a dataflow abstraction for deep learning nteworks. TensorFlow has recently released distributed version by using gRPC or by integrating with cluster management systems such as Kubernetes. Graph processing abstractions are useful in realizing complex algorithms on large real-life natural power law graphs such as Twitter or LinkedIn graphs. GraphLab and Titan are the prominent graph processing systems. GraphLab provides an efficient partitioning mechanism to split a large graph across a cluster of nodes and run algorithms at scale. It must be noted that common machine learning algorithms such as clustering or classification as well as deep learning can be realized on top of graph processing abstractions. Titan graph DB has very good integration with several NoSQLs as data sources including Cassandra and HBase as well as processing engines for machine learning including Spark, Giraph and Hadoop. We also outline our experience of implementing machine learning and deep learning algorithms over many of these abstractions.

Jun 9, 2016

Read more