Rootconf 2018

On scaling infrastructure and operations

Production Report - Using Apache Flink as a microservice for stateful asynchronous processing

Submitted by Jagadish Bihani on Saturday, 3 March 2018

Section: Crisp Talk Technical level: Advanced Status: Waitlisted


This talk highlights why we chose flink as a microservice for stateful asynchronous event processing and challenges we faced in production, how we solved those and recommendations for productionization of the applications using Apache flink.

Key takeaways:
- Architecture pattern of using Flink/similar platform as a microservice for statuful async event processing - Flink fault tolerance concepts in-depth understanding - Production issues/challenges faced and insights on how to solve (& also prevent) them

Basic understanding of stream processing will be an advantage.


  • Brief summary of what is flink and important terminologies
  • Flink as a microservice for asynchronous stateful event stream processing
    • Challenges in doing it in a conventional way
  • Prerequisite concepts
    • Fault tolerance and checkpointing
    • Scalable partitioned state
    • State Backend - Rocksdb
    • Asynchronous checkpointing details
  • Production Experiences
    • Flink taskmanager failover time tuning
    • Failure detection mechanism
    • Tuning Akka Deathwatch
    • How state leaks happen and how to prevent and monitor them
    • How to clear old state (result of state leak) of running system, without taking downtime
    • How state size and checkpointing can cause processing delays and how to tune it
  • Recommendations & Summary

Speaker bio

Software architect at Helpshift. Have worked on streaming processing,various backend architectures and end-end data pipelines before. Have a good understanding of systems side of software as well. More details can be found on


Preview video


Login to leave a comment