The Fifth Elephant 2020 edition

The Fifth Elephant 2020 edition

On data governance, engineering for data privacy and data science

Srimathi H

@shrimats

Taming the Data Elephant (aka) Productionizing Data Science!

Submitted May 31, 2020

To productionize data science and get actionable insights from raw data, require organizations to efficiently build, operate, and manage complex large scale data platforms. When it comes to productionizing ML models and achieving business value, it is very important to develop models iteratively, test and deploy on top of a robust platform infrastructure.

A fully automated platform enables us to manage the life cycle of ML-models to production with greater reliability/predictability. The platform should be responsible to ensure every commit is deployable in an automated fashion, persistence of the results of the versioned models in an auditable/readable way as well as scheduling and any dependent workflows.

Training of models in production with terabytes of data is costly in terms of training time, cost, failure chances, costs of re-run etc. Incremental training is a way to mitigate some of these costs and have an efficient model in production. In this talk, I will break down how we built a tera-bytes scale extensible and programmable data platform to enable continuous data-driven insights and how we ‘tamed the beast’ to run data science at scale. I will also cover examples of incremental training about how we migrated from a model running over a large training time series dataset, to an incremental model with weekly data.

Outline

The talk will cover the following topics:

  1. Lifecycle/Stages in productionizing an ML Model
  2. Incrementally training the models to save on cost and time taken to run the models in production
  3. Underlying platform infrastructure for deploying models to production.
    • Model persistence - Versioning of model and data
    • Data Lineage
    • Orchestrating and Monitoring workflows
  4. Impact of data volume/variety/veracity on the models
  5. Continuous monitoring of the model outputs and its predicted business metrics for accuracy over time
  6. Ease of business use of the model outputs - Reusability and Adaptiveness of generating insights and enabling business decisions
  7. Data Engineering and Data Science collaboration
    • Data Engineering to enable data scientists deploy models in production.
    • Focus on business value, iterative development and automation

Speaker bio

Srimathi is a software engineer with over 13 years of experience in building products that deliver measurable customer value. At Sahaj, she is currently part of a team building a cost-effective, tera-bytes scale extensible and programmable data platform in the advertising space. She has worked previously with Thoughtworks, Oracle, and Dell.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures