ML Governance from the Bottom-Up: Deriving Data Access Policy from Code through Ethical Monkey-Patching
Practical implementations of Data Governance tend to enforce access control at the datastore-level - think ACLs for S3, Snowflake or HDFS. But top-down enforcement of an infrastructure policy can be painful for the engineers working day-to-day with the data, especially in an ETL or Feature Engineering context. For example, critical data needed for extracting features can become obscured or even absent. Or an intermediate materialization of data can suddenly become non-compliant, requiring a pipeline rewrite. To avoid these situations, policy authors often undertake a “survey” prior to enforcement, where query logs, and static analysis help establish the footprint of a datastore before enabling access controls. While this approach sufficiently captures the who and what of data access, it fails to capture the how - crucial information about the usage of the data asset, and how downstream products depend on it.
We propose a framework for informing Data Governance “bottom-up” from data engineering code, using only open-source tools. By instrumenting code to log data access and transformations at runtime, the “survey” phase of implementing Data Governance can be almost completely automated. By leveraging frameworks for “ethical monkey-patching” i.e. changing the definition of symbols at program initialization, one can extract this metadata without having to enforce a new hygiene requirement on engineers. We’ll go through a practical example of instrumenting a production data pipeline, dependent on pandas, S3 and Snowflake, without changing a single line of application code. The result of this instrumentation is a metadata graph, and we will show how a collection of such graphs can be aggregated and queried to inform a granular, developer-friendly Data and ML Governance policy.