Scaling Data ( DB ->Caching -> Archiving -> Sharding and NoSQL)
Submitted by Shashwat Agarwal (@shashwatag) on Saturday, 16 June 2012
Big Data Infrastructure & Processing
In this talk I will go over the stages of scaling for OLTP Data Processing. Each level of scaling takes step functions in incremental effort and should be deferred for within reason in a startup ecosystem.
In this talk we will walk through standard database best practices, scaling a database, sliding time window in a database and archiving, then when the data within the sliding window itself becomes too big we need a sharding strategy for the database. We will also talk about caching and distributed cache if time permits.
Database Scaling -- Access Pattern (lookup, range scan, aggregational, joins) -- Dimensions of scaling: Size, Concurrent Queries, Latency -- Indexes -- Dictionary -- DB Drivers, Prepared Statement -- Master-Slave (staleness tolerance) -- Caching and Cache Coherency -- Distributed Cache and consistent hashing (memory is expensive) -- Replication and Replication Lag
--- When all the above tricks fail you need to archive data and sliding time window in DB -- Hadoop, Map Reduce/Hive
--- When the sliding window itself is too small - sharding -- Query decomposition (to shards) -- Sharded query execution -- Cross Shard result assimilation (some limitations) -- Eliminate partitions to scale -- Not all hashes are equal here (think of range scan and disk block reads) -- Fat rows problem, Fat keys problem, surrogate keys -- NoSQL mostly key-value stores when it is better (if time permits some depth here) -- Beyond this point think of buffered writes, queues implemented outside DB etc. will not have time to go into any depth here (We will just share some strategies point blank will not really have time to discuss these in depth)
Having used a database before and seen some issues with scaling. If in your entire career you havent hit data scaling issues then this may be a bit hard to relate to.
Vivek YS leads the kernel layer for the Data Platform team at Flipkart he has led the overall caching and distributed caching layers at Flipkart. He is also the resident expert for all Linux and kernel issues.
Shashwat Agarwal leads Data sub-team for the Data Platform effort at Flipkart and has written Custom fit for purpose large Message Systems, Notification Platforms etc.
Ashok Banerjee is VP of Data Platform and Supply Chain Engineering at Flipkart. Prior to this Ashok has worked at Twitter in San Francisco and Google in Mountain View.
Experience Summary (reverse chronologically)
Ashok today leads the technology team for Data Platform and the largest online Supply Chain infrastructure in India (Flipkart) - At Google he led a large scale Datawarehouse infrastructure which converts SQL (approximately) into execution on a platform built on MapReduce, GFS, Columnar compressed data using block oriented computing. This was at the scale of many billion rows added per day (cannot disclose how many billions) - At Google Ashok had led the payment processing infrastructure which processes payments for Adwords, Adsense, Checkout and Google Apps At BEA he worked on WebLogic Server and led infrastructure teams on EJB Container, Web Container, Classloading, Application Deployment within a Server etc. - At Oracle Ashok led the Oracle Application Server Clustering infrastructure and also worked on EJB container and RMI-IIOP Protocols
Ashok takes interest in Large Data Systems (Databases and alternative databases - NOSQL, Message Systems), Parallel Computing, Distributed Systems, Fault Tolerant Computing, Database, Recommendation Systems, Supply Chain and Mathematical Models and Investments.
On the non-work side Ashok enjoys - sailing, wind surfing, horse riding, german shepherd dogs and soccer.