Automated hyper-scalable Big data pipelines

##About Rootconf:

Rootconf is HasGeek’s annual conference -- and now a growing community -- around DevOps, systems engineering, DevSecOps, security and cloud. The annual Rootconf conference takes place in May each year, with the exception of 2019 when the conference will be held in June.

Besides the annual conference, we also run meetups, one-off public lectures, debates and open houses on DevOps, systems engineering, distributed systems, legacy infrastructure, and topics related to Rootconf.

This is the place to submit proposals for your work, and get them peer reviewed by practitioners from the community.

##Topics for submission:

We seek proposals -- for short and long talks, as well as workshops and tutorials -- on the following topics:

Case studies of shift from batch processing to stream processing
Real-life examples of service discovery
Case studies on move from monolith to service-oriented architecture
Micro-services
Network security
Monitoring, logging and alerting -- running small-scale and large-scale systems
Cloud architecture -- implementations and lessons learned
Optimizing infrastructure
SRE
Immutable infrastructure
Aligning people and teams with infrastructure at scale
Security for infrastructure

##Contact us:

If you have questions/queries, write to us on rootconf.editorial@hasgeek.com

Hosted by

Rootconf

Rootconf is a community-funded platform for activities and discussions on the following topics: Site Reliability Engineering (SRE). Infrastructure costs, including Cloud Costs - and optimization. Security - including Cloud Security. more

All submissions

Previous Next

Automated hyper-scalable Big data pipelines

Submitted Oct 18, 2018

Section: Crisp talk Technical level: Intermediate

Processing big data with maximum speed and minimum time is what every organization strives for nowadays. Cluster computing frameworks play a major role in handling data in such use cases and to serve long term goals they require Data Pipelines to be built around them. This session walks through solutions for deploying cloud native, scalable, event-driven and fault tolerant data pipelines.

Outline

Episource is building a scalable NLP engine to process thousands of medical charts every day and extract medical coding opportunities and their dependencies to recommend the best possible ICD10 codes. The architecture supporting this engine is responsible for processing high volume data on a daily basis. This session will talk about event-driven architectures that save up company cost on Infrastructure provisioning, scalable to meet the needs of changing data volumes day by day, monitored for each bit of data movement during processing in the pipeline. These are few questions which this talk will answer to:

Why there is a shift to Event-Driven Architectures?
Where Do I Start With Event-Driven Architecture?
What are the building blocks of Event-Driven Architectures?

Our architecture is a fully cloud-based solution built around AWS EMR but far advanced than how companies currently use EMR. The session will concentrate on building an architecture which doesn’t require any human intervention once it is up and deployed. Being a well-scripted data pipeline it is easy to replicate the architecture for different projects parallel and can handle large switch in requirements with few lines of code changes.

Why Automation is a key feature of Event-Driven Architectures?
Continuous Integration of organization source code repository with the pipeline saves developers from the trouble of updating code to multiple locations. A smart feature of this pipeline is it’s automated end to end i.e. the pipeline triggers automatically when the data is in place, starts processing, notify end users with results and release up resources in the end. This means to pay for resources only when your data is in the process and not for the idle time.

Logging, Monitoring and Notification systems are the pillars of any software architecture so a discussion on these will be an interesting turn to the session. Logging and monitoring at a granular level in Spark based pipeline is a big challange and the great news is that we solved it.

Episource’s technical architectural backends are lean and fast. The company can process roughly 10K charts per hour, at a few cents per chart cost (compared to a human, who can process no more than three charts per hour).

The session will sum up real-world experience over technology stacks composing of AWS Services, Apache Spark, Monitoring and logging tools and essentials of building end to end pipeline. No prior technology experience required to attend this session, just walk in with your data and the session will handover you the ideas to build event-driven architectures around it.

Requirements

No prior technology experience required to attend this session, just walk in with your data and the session will handover you the ideas to build event-driven architectures around it.

Speaker bio

My technical background involves designing application architectures to build a bridge between business requirements and technical requirements at Episource especially architecture handling Big Data processing gracefully. Designing architectures to optimize common quality attributes such as flexibility, scalability, security, and manageability. Specialties: AWS Cloud, Big Data tools, Serverless Computing, Linux, C and Python language, Docker, Ansible, flexibility and multi-tasking.

Slides

https://docs.google.com/presentation/d/14Z_ecmjgd4bOOLL2XlOHqYGDMAdOYaWPZkSiXjRs50Y/edit?usp=sharing

All submissions

Previous Next

Comments

Make a submission

Accepting submissions till 31 Dec 2020, 12:00 PM

Hosted by

Rootconf

Call for round the year submissions for Rootconf in 2020

Automated hyper-scalable Big data pipelines

Outline

Requirements

Speaker bio

Slides

Comments