The Fifth Elephant 2012

Finding the elephant in the data.

What are your users doing on your website or in your store? How do you turn the piles of data your organization generates into actionable information? Where do you get complementary data to make yours more comprehensive? What tech, and what techniques?

The Fifth Elephant is a two day conference on big data.

Early Geek tickets are available from fifthelephant.doattend.com.

The proposal funnel below will enable you to submit a session and vote on proposed sessions. It is a good practice introduce yourself and share details about your work as well as the subject of your talk while proposing a session.

Each community member can vote for or against a talk. A vote from each member of the Editorial Panel is equivalent to two community votes. Both types of votes will be considered for final speaker selection.

It’s useful to keep a few guidelines in mind while submitting proposals:

  1. Describe how to use something that is available under a liberal open source license. Participants can use this without having to pay you anything.

  2. Tell a story of how you did something. If it involves commercial tools, please explain why they made sense.

  3. Buy a slot to pitch whatever commercial tool you are backing.

Speakers will get a free ticket to both days of the event. Proposers whose talks are not on the final schedule will be able to purchase tickets at the Early Geek price of Rs. 1800.

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

Shashwat Agarwal

@shashwatag

Scaling Data ( DB ->Caching -> Archiving -> Sharding and NoSQL)

Submitted Jun 16, 2012

In this talk I will go over the stages of scaling for OLTP Data Processing. Each level of scaling takes step functions in incremental effort and should be deferred for within reason in a startup ecosystem.

In this talk we will walk through standard database best practices, scaling a database, sliding time window in a database and archiving, then when the data within the sliding window itself becomes too big we need a sharding strategy for the database. We will also talk about caching and distributed cache if time permits.

Outline

Agenda:

Database Scaling
-- Access Pattern (lookup, range scan, aggregational, joins)
-- Dimensions of scaling: Size, Concurrent Queries, Latency
-- Indexes
-- Dictionary
-- DB Drivers, Prepared Statement
-- Master-Slave (staleness tolerance)
-- Caching and Cache Coherency
-- Distributed Cache and consistent hashing (memory is expensive)
-- Replication and Replication Lag

--- When all the above tricks fail you need to archive data and sliding time window in DB
-- Hadoop, Map Reduce/Hive

--- When the sliding window itself is too small - sharding
-- Query decomposition (to shards)
-- Sharded query execution
-- Cross Shard result assimilation (some limitations)
-- Eliminate partitions to scale
-- Not all hashes are equal here (think of range scan and disk block reads)
-- Fat rows problem, Fat keys problem, surrogate keys
-- NoSQL mostly key-value stores when it is better (if time permits some depth here)
-- Beyond this point think of buffered writes, queues implemented outside DB etc. will not have time to go into any depth here (We will just share some strategies point blank will not really have time to discuss these in depth)

Requirements

Having used a database before and seen some issues with scaling. If in your entire career you havent hit data scaling issues then this may be a bit hard to relate to.

Speaker bio

Vivek YS leads the kernel layer for the Data Platform team at Flipkart he has led the overall caching and distributed caching layers at Flipkart. He is also the resident expert for all Linux and kernel issues.

Shashwat Agarwal leads Data sub-team for the Data Platform effort at Flipkart and has written Custom fit for purpose large Message Systems, Notification Platforms etc.

Ashok Banerjee is VP of Data Platform and Supply Chain Engineering at Flipkart. Prior to this Ashok has worked at Twitter in San Francisco and Google in Mountain View.

Experience Summary (reverse chronologically)

Ashok today leads the technology team for Data Platform and the largest online Supply Chain infrastructure in India (Flipkart) - At Google he led a large scale Datawarehouse infrastructure which converts SQL (approximately) into execution on a platform built on MapReduce, GFS, Columnar compressed data using block oriented computing. This was at the scale of many billion rows added per day (cannot disclose how many billions) - At Google Ashok had led the payment processing infrastructure which processes payments for Adwords, Adsense, Checkout and Google Apps At BEA he worked on WebLogic Server and led infrastructure teams on EJB Container, Web Container, Classloading, Application Deployment within a Server etc. - At Oracle Ashok led the Oracle Application Server Clustering infrastructure and also worked on EJB container and RMI-IIOP Protocols

Ashok takes interest in Large Data Systems (Databases and alternative databases - NOSQL, Message Systems), Parallel Computing, Distributed Systems, Fault Tolerant Computing, Database, Recommendation Systems, Supply Chain and Mathematical Models and Investments.

On the non-work side Ashok enjoys - sailing, wind surfing, horse riding, german shepherd dogs and soccer.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more