Automated hyper-scalable Big data pipelines

Submitted Oct 18, 2018

Section: Crisp talk Technical level: Intermediate

Processing big data with maximum speed and minimum time is what every organization strives for nowadays. Cluster computing frameworks play a major role in handling data in such use cases and to serve long term goals they require Data Pipelines to be built around them. This session walks through solutions for deploying cloud native, scalable, event-driven and fault tolerant data pipelines.

Outline

Episource is building a scalable NLP engine to process thousands of medical charts every day and extract medical coding opportunities and their dependencies to recommend the best possible ICD10 codes. The architecture supporting this engine is responsible for processing high volume data on a daily basis. This session will talk about event-driven architectures that save up company cost on Infrastructure provisioning, scalable to meet the needs of changing data volumes day by day, monitored for each bit of data movement during processing in the pipeline. These are few questions which this talk will answer to:

Why there is a shift to Event-Driven Architectures?
Where Do I Start With Event-Driven Architecture?
What are the building blocks of Event-Driven Architectures?

Our architecture is a fully cloud-based solution built around AWS EMR but far advanced than how companies currently use EMR. The session will concentrate on building an architecture which doesn’t require any human intervention once it is up and deployed. Being a well-scripted data pipeline it is easy to replicate the architecture for different projects parallel and can handle large switch in requirements with few lines of code changes.

Why Automation is a key feature of Event-Driven Architectures?
Continuous Integration of organization source code repository with the pipeline saves developers from the trouble of updating code to multiple locations. A smart feature of this pipeline is it’s automated end to end i.e. the pipeline triggers automatically when the data is in place, starts processing, notify end users with results and release up resources in the end. This means to pay for resources only when your data is in the process and not for the idle time.

Logging, Monitoring and Notification systems are the pillars of any software architecture so a discussion on these will be an interesting turn to the session. Logging and monitoring at a granular level in Spark based pipeline is a big challange and the great news is that we solved it.

Episource’s technical architectural backends are lean and fast. The company can process roughly 10K charts per hour, at a few cents per chart cost (compared to a human, who can process no more than three charts per hour).

The session will sum up real-world experience over technology stacks composing of AWS Services, Apache Spark, Monitoring and logging tools and essentials of building end to end pipeline. No prior technology experience required to attend this session, just walk in with your data and the session will handover you the ideas to build event-driven architectures around it.

Requirements

No prior technology experience required to attend this session, just walk in with your data and the session will handover you the ideas to build event-driven architectures around it.

Speaker bio

My technical background involves designing application architectures to build a bridge between business requirements and technical requirements at Episource especially architecture handling Big Data processing gracefully. Designing architectures to optimize common quality attributes such as flexibility, scalability, security, and manageability. Specialties: AWS Cloud, Big Data tools, Serverless Computing, Linux, C and Python language, Docker, Ansible, flexibility and multi-tasking.

Slides

https://docs.google.com/presentation/d/14Z_ecmjgd4bOOLL2XlOHqYGDMAdOYaWPZkSiXjRs50Y/edit?usp=sharing

Call for round the year submissions for Rootconf in 2020