Call for round the year submissions for Rootconf in 2020

Submit a proposal at any time in the year on DevOps, infrastructure security, cloud, and distributed systems. We will find you a suitable opportunity to share your work.

Automated hyper-scalable Big data pipelines

Submitted by Prakshi Yadav (@prakshi24) on Oct 18, 2018

Section: Crisp talk Technical level: Intermediate Status: Waitlisted


Processing big data with maximum speed and minimum time is what every organization strives for nowadays. Cluster computing frameworks play a major role in handling data in such use cases and to serve long term goals they require Data Pipelines to be built around them. This session walks through solutions for deploying cloud native, scalable, event-driven and fault tolerant data pipelines.


Episource is building a scalable NLP engine to process thousands of medical charts every day and extract medical coding opportunities and their dependencies to recommend the best possible ICD10 codes. The architecture supporting this engine is responsible for processing high volume data on a daily basis. This session will talk about event-driven architectures that save up company cost on Infrastructure provisioning, scalable to meet the needs of changing data volumes day by day, monitored for each bit of data movement during processing in the pipeline. These are few questions which this talk will answer to:

Why there is a shift to Event-Driven Architectures? Where Do I Start With Event-Driven Architecture? What are the building blocks of Event-Driven Architectures?

Our architecture is a fully cloud-based solution built around AWS EMR but far advanced than how companies currently use EMR. The session will concentrate on building an architecture which doesn’t require any human intervention once it is up and deployed. Being a well-scripted data pipeline it is easy to replicate the architecture for different projects parallel and can handle large switch in requirements with few lines of code changes.

Why Automation is a key feature of Event-Driven Architectures? Continuous Integration of organization source code repository with the pipeline saves developers from the trouble of updating code to multiple locations. A smart feature of this pipeline is it’s automated end to end i.e. the pipeline triggers automatically when the data is in place, starts processing, notify end users with results and release up resources in the end. This means to pay for resources only when your data is in the process and not for the idle time.

Logging, Monitoring and Notification systems are the pillars of any software architecture so a discussion on these will be an interesting turn to the session. Logging and monitoring at a granular level in Spark based pipeline is a big challange and the great news is that we solved it.

Episource’s technical architectural backends are lean and fast. The company can process roughly 10K charts per hour, at a few cents per chart cost (compared to a human, who can process no more than three charts per hour).

The session will sum up real-world experience over technology stacks composing of AWS Services, Apache Spark, Monitoring and logging tools and essentials of building end to end pipeline. No prior technology experience required to attend this session, just walk in with your data and the session will handover you the ideas to build event-driven architectures around it.


No prior technology experience required to attend this session, just walk in with your data and the session will handover you the ideas to build event-driven architectures around it.

Speaker bio

My technical background involves designing application architectures to build a bridge between business requirements and technical requirements at Episource especially architecture handling Big Data processing gracefully. Designing architectures to optimize common quality attributes such as flexibility, scalability, security, and manageability. Specialties: AWS Cloud, Big Data tools, Serverless Computing, Linux, C and Python language, Docker, Ansible, flexibility and multi-tasking.


Preview video


  • Zainab Bawa (@zainabbawa) a year ago

    Share your draft slides and preview video by 18 February to close evaluation on your proposal.

  • Zainab Bawa (@zainabbawa) a year ago

    Prakshi, apologies for the gap in communicating regarding the proposal. Thank you for the revised slides. Here is the feedback on the structure and scope of the presentation:

    1. The problem statement “building a scalable NLP pipeline for information extraction from medical discharge summaries” is a problem statement for Episource. It is not a problem statement which resonates with participants at Rootconf because the problem is centred on Episource as a company. Participants don’t want to know how a company solved its problems. They want to know what companies learned in the process of solving problems which can help them in their practice as DevOps, systems engineers, SREs, etc.
    2. What we want is what is a problem statement – based on the experience of building event-driven architecture – that can be generalized for participants at Rootconf?
    3. One of the ways to rethink this talk is to explain why the decision was made to use this approach to solve the problem? In the current version, you are explaining “how” event-driven architecture is built and “how” it works? Whereas, you have to show the audience why you decided on this approach. What could have been other approaches of solving the problem? How did you compare – in terms of criteria, metrics and benchmarks – the approaches for solving the problem? And therefore, why did you choose this approach versus others?
    4. Apart from showing the limitations of event-driven architecture in general, you have to explain how this approach has impacted Episource’s situation – what was the situation before and what is it after implementation? Show metrics and concrete examples.

    In the current form, the presentation comes across as a pitch for Episource’s solution. This is uninteresting for participants at Rootconf. Hence, this proposal needs more work before we can consider it for Rootconf. If you continue to refine the idea and come up with revised slides, we’d be happy to help mentor this for Pune/Delhi/Hyderabad editions of Rootconf. Let us know.

Login to leave a comment