The Fifth Elephant 2016

India's most renowned data science conference

Shubhadit Sharma

@shubhadit

Data pipelines - Cakewalk with Docker and Luigi

Submitted Apr 30, 2016

Modern data driven products are powered by pipelines of data processing tasks. Building this infrastructure requires a lot of boiler plate code. Moreover deploying these tasks consistently accross development to production environment, and maintaining resource isolation can cause longer development cycles. Maintaing different versions of datasets and tracking improvement of your model on these versions can become tedious very quickly.

Enters Luigi and docker

Luigi acts as an orchestration layer, defining dependencies between tasks. Pipelines are containerized to make them portable, isolated, and easy to monitor.
Anyone who wants to make a data driven product at scale without the constraint of limiting their team to one programming language will have something to takeaway.

Outline

  • Problems with current way of building data pipelines
    • Introduction to Luigi and Docker
    • General Architecture and flow of data in system
    • Intricacies of Machine learning in fintech
      • Handling Sensitive customer data

Speaker bio

Got my first computer in 1998, tinkering with code ever since. From writing my first program 17 years ago in C to now creating highly scalable systems at Finomena - A Bangalore based data-driven, credit-underwriting Fin-tech startup.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures