The Fifth Elephant 2014

A conference on big data and analytics

In 2014, infrastructure components such as Hadoop, Berkeley Data Stack and other commercial tools have stabilized and are thriving. The challenges have moved higher up the stack from data collection and storage to data analysis and its presentation to users. The focus for this year’s conference on analytics – the infrastructure that powers analytics and how analytics is done.

Talks will cover various forms of analytics including real-time and opportunity analytics, and technologies and models used for analyzing data.

Proposals will be reviewed using 5 criteria:
Domain diversity – proposals will be selected from different domains – medical, insurance, banking, online transactions, retail. If there is more than one proposal from a domain, the one which meets the editorial criteria will be chosen.
Novelty – what has been done beyond the obvious. Insights – what insights does the proposal share with the audience that they did not know earlier. Practical versus theoretical – we are looking for applied knowledge. If the proposal covers material that can be looked up online, it will not be considered.
Conceptual versus tools-centric – tell us why, not how. Tell the audience what was the philosophy underlying your use of an application, not how an application was used. Presentation skills – proposer’s presentation skills will be reviewed carefully and assistance provided to ensure that the material is communicated in the most precise and effective manner to the audience.



For queries about proposals / submissions, write to


  1. Data Collection and Transport – for e.g, Opendatatoolkit, Scribe, Kafka, RabbitMQ, etc.

  2. Data Storage, Caching and Management – Distributed storage (such as Gluster, HDFS) or hardware-specific (such as SSD or memory) or databases (Postgresql, MySQL, Infobright) or caching/storage (Memcache, Cassandra, Redis, etc).

  3. Data Processing, Querying and Analysis – Oozie, Azkaban, scikit-learn, Mahout, Impala, Hive, Tez, etc.

  4. Real-time analytics

  5. Opportunity analytics

  6. Big data and security

  7. Big data and internet of things

  8. Data Usage and BI (Business Intelligence) in different sectors.

Please note: the technology stacks mentioned above indicate latest technologies that will be of interest to the community. Talks should not be on the technologies per se, but how these have been used and implemented in various sectors, enterprises and contexts.

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

Ashok Banerjee


Spot the model hiding in the Big Data

Submitted Apr 30, 2014

This talk is intended to help businesses avoid expensive incorrect decisions based on poor understanding of the underlying models. In this talk I shall discuss ways to understand a phenomenon by triangulating across visualizations, underlying model understanding and experimentation.


As the volume of data increases we as humans need abstractions. At the first level we resort to aggregate measures. However not all aggregate measures are meaningful for all phenomena. Picking the wrong aggregate measure, and then fine tuning parameters may over-fit the distribution but perform poorly on predictions and lead to fatally flawed conclusions. To understand the models we resort to visualizations, segmentation and then picking mathematical models followed by parametrization of the models.

In this talk I shall discuss the most common models seen repeatedly in nature. And some techniques I use to help me spot the udnerlying model.

We shall start in this discussion with:
1) Bernoulli Trials (basic coin flips etc.)
2) Aggregate statistics and Normal distribution seen across predictions of time in Map Reduce (as Mappers increase)
3) Exponential Models - radioactive decay, word of mouth growth, epidemic propagation, Why we see the value of “e” so often in nature.
4) Poisson: Arrivals in a queueat the store, incidents of accidents on highways, insurance modelling, website server Capacity Planning etc.
5) Erlang Queues at Call centers, queues of website requests, queues of Database requests
6) Brownian Motion and Random Walks: Stock Markets and quantitative analysis of stocks (If time permits)

Starting with the right model in mind often allows the system to converge rapidly to the required models.


  • Basic mathematics and some idea of statistics

Speaker bio

Ashok Banerjee is the CTO of EBusiness at Symantec. Ashok has 23 patents approved to date and counting. Prior to Symantec Ashok has led Engineering teams at Google, Twitter, Flipkart etc.

Ashok takes interest in Large Data Systems (Databases and alternative databases - NOSQL, Message Systems), Parallel Computing, Distributed Systems, Fault Tolerant Computing, Database, Recommendation Systems, Supply Chain and Mathematical Models and Investments.

On the non-work side Ashok enjoys - sailing, wind surfing, horse riding, german shepherd dogs and soccer.

Experience Summary (reverse chronologically)

Ashok today leads the EBusiness team at Symantec technology team for Data Platform and Analytics at Flipkart and has also led the largest online Supply Chain infrastructure in India (Flipkart) - At Google he led a large scale Datawarehouse infrastructure which converts SQL (approximately) into execution on a platform built on MapReduce, GFS, Columnar compressed data using block oriented computing. This was at the scale of many billion rows added per day (cannot disclose how many billions) - At Google Ashok had led the payment processing infrastructure which processes payments for Adwords, Adsense, Checkout and Google Apps Prior to that Ashok led engineering efforts at BEA WebLogic.


{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more