The Fifth Elephant 2014

A conference on big data and analytics

Spot the model hiding in the Big Data

Submitted by Ashok Banerjee (@ashokbanerjee) on Wednesday, 30 April 2014

videocam_off

Technical level

Beginner

Section

Full talk

Status

Submitted

Vote on this proposal

Login to vote

Total votes:  +34

Objective

This talk is intended to help businesses avoid expensive incorrect decisions based on poor understanding of the underlying models. In this talk I shall discuss ways to understand a phenomenon by triangulating across visualizations, underlying model understanding and experimentation.

Description

As the volume of data increases we as humans need abstractions. At the first level we resort to aggregate measures. However not all aggregate measures are meaningful for all phenomena. Picking the wrong aggregate measure, and then fine tuning parameters may over-fit the distribution but perform poorly on predictions and lead to fatally flawed conclusions. To understand the models we resort to visualizations, segmentation and then picking mathematical models followed by parametrization of the models.

In this talk I shall discuss the most common models seen repeatedly in nature. And some techniques I use to help me spot the udnerlying model.

We shall start in this discussion with:
1) Bernoulli Trials (basic coin flips etc.)
2) Aggregate statistics and Normal distribution seen across predictions of time in Map Reduce (as Mappers increase)
3) Exponential Models - radioactive decay, word of mouth growth, epidemic propagation, Why we see the value of "e" so often in nature.
4) Poisson: Arrivals in a queueat the store, incidents of accidents on highways, insurance modelling, website server Capacity Planning etc.
5) Erlang Queues at Call centers, queues of website requests, queues of Database requests
6) Brownian Motion and Random Walks: Stock Markets and quantitative analysis of stocks (If time permits)

Starting with the right model in mind often allows the system to converge rapidly to the required models.

Requirements

  • Basic mathematics and some idea of statistics

Speaker bio

Ashok Banerjee is the CTO of EBusiness at Symantec. Ashok has 23 patents approved to date and counting. Prior to Symantec Ashok has led Engineering teams at Google, Twitter, Flipkart etc.

Ashok takes interest in Large Data Systems (Databases and alternative databases - NOSQL, Message Systems), Parallel Computing, Distributed Systems, Fault Tolerant Computing, Database, Recommendation Systems, Supply Chain and Mathematical Models and Investments.

On the non-work side Ashok enjoys - sailing, wind surfing, horse riding, german shepherd dogs and soccer.

Experience Summary (reverse chronologically)

Ashok today leads the EBusiness team at Symantec technology team for Data Platform and Analytics at Flipkart and has also led the largest online Supply Chain infrastructure in India (Flipkart) - At Google he led a large scale Datawarehouse infrastructure which converts SQL (approximately) into execution on a platform built on MapReduce, GFS, Columnar compressed data using block oriented computing. This was at the scale of many billion rows added per day (cannot disclose how many billions) - At Google Ashok had led the payment processing infrastructure which processes payments for Adwords, Adsense, Checkout and Google Apps Prior to that Ashok led engineering efforts at BEA WebLogic.

Comments

  • 1
    Chandra B (@bchandy) 4 years ago

    This is interesing.

  • 1
    Ashok Banerjee (@ashokbanerjee) Proposer 4 years ago

    Thank you Chandra!

  • 1
    Kat Bri (@katbri) 4 years ago

    When is this going to take place?

  • 1
    Abhishekhar Prasad (@abhishekhar) 4 years ago

    looking forward to it.

  • 1
    Inder Singh (@indersingh) 4 years ago

    Looking forward to this.

  • 1
    Amit Kapoor (@amitkaps) 4 years ago

    Really like the topic and looking forward to it.

    Though I don't think understanding the pattern behind the data is only about mathematical models. Data Abstraction (e.g. aggregate stats, parameterics), Visual abstraction (e.g. visualization) and Symbolic abstraction (e.g. mathematical model) can all work together in an interactive manner to help us gain insights in to the system. Bret Victor showcases this idea brilliantly in his essay - Up and down the ladder of abstraction. Just some food for thought.

  • 1
    Ashok Banerjee (@ashokbanerjee) Proposer 4 years ago

    Hi Amit, Nice article. Agreed on both Data Abstraction and Visual Abstraction being important. Will see how to incorporate these briefly but explicitly. The usual flow I follow - Visual Abstractions help with idea/insight, and then move to Symbolic (model structure) and then look for parameters .

Login with Twitter or Google to leave a comment