Spot the model hiding in the Big Data
Submitted by Ashok Banerjee (@ashokbanerjee) on Wednesday, 30 April 2014
This talk is intended to help businesses avoid expensive incorrect decisions based on poor understanding of the underlying models. In this talk I shall discuss ways to understand a phenomenon by triangulating across visualizations, underlying model understanding and experimentation.
As the volume of data increases we as humans need abstractions. At the first level we resort to aggregate measures. However not all aggregate measures are meaningful for all phenomena. Picking the wrong aggregate measure, and then fine tuning parameters may over-fit the distribution but perform poorly on predictions and lead to fatally flawed conclusions. To understand the models we resort to visualizations, segmentation and then picking mathematical models followed by parametrization of the models.
In this talk I shall discuss the most common models seen repeatedly in nature. And some techniques I use to help me spot the udnerlying model.
We shall start in this discussion with:
1) Bernoulli Trials (basic coin flips etc.)
2) Aggregate statistics and Normal distribution seen across predictions of time in Map Reduce (as Mappers increase)
3) Exponential Models - radioactive decay, word of mouth growth, epidemic propagation, Why we see the value of "e" so often in nature.
4) Poisson: Arrivals in a queueat the store, incidents of accidents on highways, insurance modelling, website server Capacity Planning etc.
5) Erlang Queues at Call centers, queues of website requests, queues of Database requests
6) Brownian Motion and Random Walks: Stock Markets and quantitative analysis of stocks (If time permits)
Starting with the right model in mind often allows the system to converge rapidly to the required models.
- Basic mathematics and some idea of statistics
Ashok Banerjee is the CTO of EBusiness at Symantec. Ashok has 23 patents approved to date and counting. Prior to Symantec Ashok has led Engineering teams at Google, Twitter, Flipkart etc.
Ashok takes interest in Large Data Systems (Databases and alternative databases - NOSQL, Message Systems), Parallel Computing, Distributed Systems, Fault Tolerant Computing, Database, Recommendation Systems, Supply Chain and Mathematical Models and Investments.
On the non-work side Ashok enjoys - sailing, wind surfing, horse riding, german shepherd dogs and soccer.
Experience Summary (reverse chronologically)
Ashok today leads the EBusiness team at Symantec technology team for Data Platform and Analytics at Flipkart and has also led the largest online Supply Chain infrastructure in India (Flipkart) - At Google he led a large scale Datawarehouse infrastructure which converts SQL (approximately) into execution on a platform built on MapReduce, GFS, Columnar compressed data using block oriented computing. This was at the scale of many billion rows added per day (cannot disclose how many billions) - At Google Ashok had led the payment processing infrastructure which processes payments for Adwords, Adsense, Checkout and Google Apps Prior to that Ashok led engineering efforts at BEA WebLogic.