Rootconf 2018

Rootconf 2018

On scaling infrastructure and operations

Jagadish Bihani

Production Report - Using Apache Flink as a microservice for stateful asynchronous processing

Submitted Mar 3, 2018

This talk highlights why we chose flink as a microservice for stateful asynchronous event processing and challenges we faced in production, how we solved those and recommendations for productionization of the applications using Apache flink.

Key takeaways:
- Architecture pattern of using Flink/similar platform as a microservice for statuful async event processing
- Flink fault tolerance concepts in-depth understanding
- Production issues/challenges faced and insights on how to solve (& also prevent) them

Basic understanding of stream processing will be an advantage.


  • Brief summary of what is flink and important terminologies
  • Flink as a microservice for asynchronous stateful event stream processing
    • Challenges in doing it in a conventional way
  • Prerequisite concepts
    • Fault tolerance and checkpointing
    • Scalable partitioned state
    • State Backend - Rocksdb
    • Asynchronous checkpointing details
  • Production Experiences
    • Flink taskmanager failover time tuning
    • Failure detection mechanism
    • Tuning Akka Deathwatch
    • How state leaks happen and how to prevent and monitor them
    • How to clear old state (result of state leak) of running system, without taking downtime
    • How state size and checkpointing can cause processing delays and how to tune it
  • Recommendations & Summary

Speaker bio

Software architect at Helpshift. Have worked on streaming processing,various backend architectures and end-end data pipelines before. Have a good understanding of systems side of software as well. More details can be found on



{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}