The Fifth Elephant 2020 edition
On data governance, engineering for data privacy and data science
Srimathi H
To productionize data science and get actionable insights from raw data, require organizations to efficiently build, operate, and manage complex large scale data platforms. When it comes to productionizing ML models and achieving business value, it is very important to develop models iteratively, test and deploy on top of a robust platform infrastructure.
A fully automated platform enables us to manage the life cycle of ML-models to production with greater reliability/predictability. The platform should be responsible to ensure every commit is deployable in an automated fashion, persistence of the results of the versioned models in an auditable/readable way as well as scheduling and any dependent workflows.
Training of models in production with terabytes of data is costly in terms of training time, cost, failure chances, costs of re-run etc. Incremental training is a way to mitigate some of these costs and have an efficient model in production. In this talk, I will break down how we built a tera-bytes scale extensible and programmable data platform to enable continuous data-driven insights and how we ‘tamed the beast’ to run data science at scale. I will also cover examples of incremental training about how we migrated from a model running over a large training time series dataset, to an incremental model with weekly data.
The talk will cover the following topics:
Srimathi is a software engineer with over 13 years of experience in building products that deliver measurable customer value. At Sahaj, she is currently part of a team building a cost-effective, tera-bytes scale extensible and programmable data platform in the advertising space. She has worked previously with Thoughtworks, Oracle, and Dell.
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}