Previous proposalPersonalized Recommendations for Computational Advertising
Next proposalExpressing complex ETL pipelines using Cascading
Building Streaming platform using Kafka Streams
At Walmart TB’s of data gets generated per day via interactions, transactions by our users on walmart.com and other properties(in-store, jet.com etc). As part of our Customer data strategy we strive to increase Reach, Depth, Freshness to know about more customers, more about customers, and in as real-time as possible. Towards this goal, we need to ingest data as when it is generated and process it to gain insights about our customers.
Our current streaming platform is built using Kafka, Storm & Couchbase. As we plan to ingest more data we observed that lookups in Couchbase from our Storm processes over network is a bottleneck. We have evaluated few technologies like Samza, Flink and Kafka Streams which persist state on the machine that is processing the messages, so that network calls are not necessary.
We have chosen Kakfa Streams as a technology over Samza & Flink. Kafka Streams has borrowed few good ideas from Samza.
And has added few important features like Standby Replica (increases availability) & Interactive Queries (makes state queryable).
We have implemented few useful features on top of Kafka Streams like ::
(1) Storage Policies (Archival / TTL / Compaction) which prevents not-so-recent data occupying disk space. (2) Ability to query state even when task is RESTORATION state (during rebalancing). This increases availability. (3) Currently, in Kafka Streams two-hops are required for asnwering interactive queries, we have developed mechanism so that only one hop is enough. (4) Currently, in Kafka Streams Changelog Kafka Topics have infinite retention time to support restoration of state in case of failures. But if state is huge it is not feasible to have infinite retention time. We have developed mechanism so that it is possible to build state even with finite retention time setting for Changelog Kafka Topics. (5) Added rack-awareness feature that ensures active and standby replica tasks are not scheduled to run on the same rack.
Problem Statement - Build platform that helps in developing stateful applications that ingest events in real-time.
Discuss previous version of Streaming Platform that used Kafka, Storm, Couchbase and discuss its shortcomings.
Discuss few alternatives like Samza, Flink and reasons for our choice of Kafka Streams
Discuss few features we have built on top of Kafka Streams which improves efficiency, availability.
Discuss how we productionalized our streaming platform.
I am Giridhar Addepalli with over 9 years of experience as Software Developer. Currently working as Staff Engineer at WalmartLabs.