Troubleshooting Kafka's socket server
Submitted by Ayyappadas Ravindran Nair (@ayyappa) on Sunday, 31 January 2016
- Understanding life cycle of Kafka-request
- Understanding how a trivial (metrics addition) change caused a Kafka cluster to crumble under high load causing frontend user impact. (KAFKA-2664)
The talk is about a Kafka outage which caused frontend user impact. This is a very rare occation in Linkedin, where a backend messaging system outage causing front end impact. The presentation will touch base on Kafka request cycle, we would dissect a fetch request, will do profiling of verious API calls & also will talk about how we fixed the issue.
Good understanding of Kafka & Kafka ecosystem. We won’t be able to cover Kafka basics.
I am leading “Data Infra Streaming” SRE team in Linkedin Bangalore. My team, takes care of Kafka, Samza and Zookeeper platform in Linkedin. Before joining Linkedin, I was worked for Intuit & Yahoo!. In Yahoo!, I was leading a SE (Service Engineering) team who were taking care of Hadoop platform in Yahoo. Detailed profile can be found here https://in.linkedin.com/in/ayyappa