Rootconf 2017

On service reliability

Imran Basha

@syedimranbasha

AWS Simple Workflow Service as an architectural solution for building Distributed Scalable Background Scheduled Jobs

Submitted Feb 6, 2017

This talk is about Simple Workflow Service as an infrastructure support for developing Distributed Scalable Background Scheduled Jobs. We will teach how one can replace Cron based Workflow with Amazon Simple Workflow Service

Key take aways
What is Cron based Workflow ?
Issues in Cron based Workflow
What is SWF ?
How to replace Cron based workflow using SWF
How to Monitor workflows based on SWF using AWS CloudWatch?

Outline

Problem statement

We need background jobs that can run on multiple clusters on a daily scheduled basis. These jobs process millions of data every day. In order to better load balance across clusters we need to divide the data across all the clusters and than trigger the jobs to run on their portion of data within a cluster. This is explained in Slide #3 of the video. we were spawning Worker processs on the cluster machines from a master machine where Crontab files are setup. This is a typical requirement of running background job in distributed setup.

Issues with Cron based scheduling

  1. Lack of Failure handling
  2. lost tasks
  3. Scale
  4. Not an option on shared hosting setup
  5. Single point of failure

How SWF helped in our particular use case ?

Cron are good for running a job on that particular machine on a scheduled basis but when it comes to distributed execution in a Cluster setup we need co-ordination, failure handling, scalability etc... which doesn’t come out of box with Cron based solutions. SWF helps in creating a distributed workflow which we can run at scheduled intervals and submit commands to Worker processes running in individual machines which can pick up the tasks and start executing on it.

Benefits on using SWF

  1. SWF takes care of co-ordination between workflow and worker processes
  2. Architecture becomes scalable as state management is owned by SWF and there exists a loose coupling between workflow nad worker processes
  3. SWF is a better way of handling distributed execution as it provides Flow Framework for managing issues in dsitributed application like failures, retries etc...
  4. Solution has clear separation of concerns
  5. If any of the machine goes down the load automatically gets transferred to other machine
  6. SWF provides End-End solution including Monitoring metrics

This way we end in a loosely coupled, highly scalable, distributed solution with Co-ordination and State management taken care by Amazon SWF.

SWF cannot be described as Job Scheduler. Better way to describe SWF is it provides the necessary framework and services that enables us to create a distibuted workflow that can executed on multiple machines in a loosely coupled and high scalable manner.

I made an attempt in explaining the above things in the submitted video. Would be happy to clarify any subsequent questions.

Speaker bio

I am Imran. I have been working with Inutit since 6 years. Totally I have around 13 years of experience. I was fortunate enough to explore and contribute from breadth of Technologies to Depth. Primarily into Full stack web application development in both .Net using WEBAPI’s and Java based on Jersey, SPA application development based on Backbone, Marionette, React + Relay + GraphQL. I am a technology enthusiast. I was the Architect involved in migrating Cron based workflow to AWS SWF. I encountered lot of learnings in the journey of transformation migrating from an unreliable Cron based infrastructure to a Reliable, Distributed and Highly scalable architecture based on AWS SWF. Wanted to share the learnings so that it can benefit other people.

Slides

https://github.com/syedimranbasha/simpleworkflowservice/blob/master/Architecture - RootConf.pptx

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We care about site reliability, cloud costs, security and data privacy