Scaling up our distributed query workloads using Kafka Streams + Rocks DB
The Analytics platform powers the Business iQ product @ AppDynamics (now part of Cisco). Business iQ provides for real-time and actionable correlations between application performance, user experience and business outcomes/performance. Business health baselines, anomaly detection, and alerts are all automated and immediately actionable through the use of business metrics and events. The platform itself supports large scale data processing (tens of TB per day), distributed storage (peta-byte scale) and interactive querying (multi million) of these business events.
In this talk, we will first provide context on event query patterns in our system based on the business use cases, how we have used elastic search to hold variety of data documents and support our query workloads. We will then go into challenges faced with our existing solution using real customer use cases specifically related to query scaling, data discrepancies during processing delays, and shifting patterns of data and query workloads. In the end, we will cover details on different approaches which were discussed to solve for the challenges and why we ended up using a hybrid store with Kafka Streams + In-memory store (Rocks DB) as an additional storage layer for more recent data.
Aim of the talk is to go through the journey of our evolving architecture using actual customer use-cases, learnings we have had and best practices for running a large scale data driven application in the cloud (AWS). This talk will be relevant for anyone who is passionate about large scale distributed processing.
- Quick update on AppDynamics and its business
- Platform requirements for supporting business use cases
- Context on query patterns in the system
- The use of elastic search for holding variety of data, and query workloads
- Challanges faced with existing solution using real customer use cases
- Different approaches to address challanges, and our solution using hybrid store.
- Lessons learnt from running the solution at scale in production environments
- Further improvements and future work
I have more than 13+ years of experience working as software professional with organisations - HP, Adobe Systems, BloomReach, and current at AppDynamics.During this years, I worked on various technology stack s and different products starting from desktop publishing tools, 2D vector graphics rendering flash platform to complex distributed systems. In recent past, I worked with a very small team to help in building and scaling an e-commerce personalised multi-tenant search platform while supporting 99.99 uptime and 100ms average latency. I have also written a lightweight API gateway (BloomGateway:https://github.com/bloomreach/bloomgateway) from scratch to support multi region fallbacks, api level bucketing and security with low (zero) deployment/management overhead. Currently, I am working as Principal Engineer with AppDynamics and helping in building Business iQ platform.