Rootconf 2019

On infrastructure security, cloud architecture, cloud optimization and distributed systems

Automated hyper-scalable Big data pipelines

Submitted by Prakshi Yadav (@prakshi24) on Wednesday, 17 October 2018

videocam_off

Technical level

Intermediate

Section

Crisp talk of 20 mins duration

Status

Submitted

Vote on this proposal

Login to vote

Total votes:  +1

Abstract

Processing big data with maximum speed and minimum time is what every organization strives for nowadays. Cluster computing frameworks play a major role in handling data in such use cases and to server long term goals they require Data Pipelines to be built around them. This session walks through solutions for deploying cloud native, scalable, event-driven and fault tolerant data pipelines.

Outline

Episource is building a scalable NLP engine to process thousands of medical charts every day and extract medical coding opportunities and their dependencies to recommend the best possible ICD10 codes. The architecture supporting this engine is responsible for processing high volume data on a daily basis. This session will talk about event-driven architectures that save up company cost on Infrastructure provisioning, scalable to meet the needs of changing data volumes day by day, monitored for each bit of data movement during processing in the pipeline.

Our architecture is a fully cloud-based solution built around AWS EMR but far advanced than how companies currently use EMR. The session will concentrate on building an architecture which doesn’t require any human intervention once it is up and deployed. Being a well-scripted data pipeline it is easy to replicate the architecture for different projects parallel and can handle large switch in requirements with few lines of code changes.

Continuous Integration of organization source code repository with the pipeline saves developers from the trouble of updating code to multiple locations. A smart feature of this pipeline is it’s automated end to end i.e. the pipeline triggers automatically when the data is in place, starts processing, notify end users with results and release up resources in the end. This means to pay for resources only when your data is in the process and not for the idle time.

Logging, Monitoring and Notification systems are the pillars of any software architecture so a discussion on these will be an interesting turn to the session. Logging and monitoring helped us in making optimizations in our architecture design.

Episource’s technical architectural backends are lean and fast. The company can process roughly 10K charts per hour, at a few cents per chart cost (compared to a human, who can process no more than three charts per hour).

The session will sum up real-world experience over technology stacks composing of AWS Services, Apache Spark, Monitoring and logging tools and essentials of building end to end pipeline. No prior technology experience required to attend this session, just walk in with your data and the session will handover you the ideas to build event-driven architectures around it.

Requirements

No prior technology experience required to attend this session, just walk in with your data and the session will handover you the ideas to build event-driven architectures around it.

Speaker bio

My technical background involves designing application architectures to build a bridge between business requirements and technical requirements at Episource especially architecture handling Big Data processing gracefully. Designing architectures to optimize common quality attributes such as flexibility, scalability, security, and manageability. Specialties: AWS Cloud, Big Data tools, Serverless Computing, Linux, C and Python language, Docker, Ansible, flexibility and multi-tasking.

Comments

Login with Twitter or Google to leave a comment