Submissions for MLOps November edition

On ML workflows, tools, automation and running ML in production

This project is accepting submissions for MLOps November conference edition.

The first edition of the MLOps conference was held on 23, 24 and 27 July. Details about the conference including videos and blog posts are published at https://hasgeek.com/fifthelephant/mlops-conference/

Contact information: For inquiries, contact The Fifth Elephant on fifthelephant.editorial@hasgeek.com or call 7676332020.

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more

Aaditya Talwai

@talwai

ML Governance from the Bottom-Up: Deriving Data Access Policy from Code through Ethical Monkey-Patching

Submitted Jul 8, 2021

Practical implementations of Data Governance tend to enforce access control at the datastore-level - think ACLs for S3, Snowflake or HDFS. But top-down enforcement of an infrastructure policy can be painful for the engineers working day-to-day with the data, especially in an ETL or Feature Engineering context. For example, critical data needed for extracting features can become obscured or even absent. Or an intermediate materialization of data can suddenly become non-compliant, requiring a pipeline rewrite. To avoid these situations, policy authors often undertake a “survey” prior to enforcement, where query logs, and static analysis help establish the footprint of a datastore before enabling access controls. While this approach sufficiently captures the who and what of data access, it fails to capture the how - crucial information about the usage of the data asset, and how downstream products depend on it.

We propose a framework for informing Data Governance “bottom-up” from data engineering code, using only open-source tools. By instrumenting code to log data access and transformations at runtime, the “survey” phase of implementing Data Governance can be almost completely automated. By leveraging frameworks for “ethical monkey-patching” i.e. changing the definition of symbols at program initialization, one can extract this metadata without having to enforce a new hygiene requirement on engineers. We’ll go through a practical example of instrumenting a production data pipeline, dependent on pandas, S3 and Snowflake, without changing a single line of application code. The result of this instrumentation is a metadata graph, and we will show how a collection of such graphs can be aggregated and queried to inform a granular, developer-friendly Data and ML Governance policy.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

The Fifth Elephant - known as one of the best data science and Machine Learning conference in Asia - has transitioned into a year-round forum for conversations about data and ML engineering; data science in production; data security and privacy practices. more