arrow_back Qubole Sparklens: understanding the scalability limits of Spark applications
Incremental transform of transactional data models to analytical data models in near real time
Submitted by Govind Pandey (@govind-pandey) on Monday, 26 March 2018
Transactional systems are designed with data models to maximize write throughput across multiple parallel business flows. They evolve iteratively with business and need to react quickly to the changing business landscape to minimize time to market.
Analytical systems, on the other hand, require data models to maximize query throughput over broad, deep and large data volumes.
The need for a platform which transforms from the transactional data model to an analytical data model is well established in the industry. This is currently achieved through two different paradigms. Stream processing at lower latencies and batch processing at higher latencies.
We have solved the same problem through a third paradigm of incremental processing for intermediate latencies (5 minutes to 1 hour).
We considered and dropped implementations of the streaming paradigm either because of a lack of completeness guarantees or the absence of complex join capabilities across a large number of entities.
Our incremental processing platform transforms transactional data models to analytical data models. It provides expressability for complex joins across multiple entities (live with 30) through a Transformation Definition Language (TDL). These complex joins are evaluated incrementally as transactional data changes to periodically update the analytical data model. For near real time use cases, this is done every 5-10 minutes.
Changes to transactional data models are handled through version support in the TDL. These changes are absorbed with a pause and resume of the transformations.
The Flipkart Fulfillment Services Group serves over a million shipments in a day at its peak. Customer delight through a reliable and fast delivery of orders is our primary goal. To succeed in this endeavour, our ground operations depends on live and accurate visibility into the journey of all shipments pan India. Overall data volumes range in 10s of TBs with a change frequency of over 25k QPS at peak. All our transactional systems combined generate mutations with volumes close to 200 GB every second. Our incremental platform is built to handle this scale.
With this platform, we have achieved analytics at low latencies with high completeness without compromising on business agility.
In this talk, I will cover the specifics of our evaluations and our learnings from the journey of building the platform.
- Business and technical need
a. 100% Completeness b. 5 minutes to 1 hour latencies c. Business Agility
- Evaluation and results of existing solutions
a. Existing stream processing implementations b. Existing incremental processing implementations
- Our approach to solving the problem
a. Incremental Transforms at scale for lower latencies b. Metadata c. Processing d. Learnings
a. Live use-cases and Impact
Govind is focussing on Supply Chain Automation, Predictive Optimizations and Actionable Insights by leveraging Artificial Intelligence and in house Big Data platforms at Flipkart. His engineering experience is primarily in building platforms. In the past, he has worked on the inhouse stream processing platform at 247.ai and the Business Activity Monitoring, Business Rules and Process Orchestration platforms at Oracle. He is an Advanced Communicator Silver and Advanced Leader Silver as certified by Toastmasters International.