Data compliance, privacy and security is a long journey if one doesn’t have any systems in place. They should start with 3 simple questions:
- Where is the data?
- Who has access to data?
- How is the data used?
To understand where data is, one needs a data catalog, recognize sensitive data, and to capture data lineage to tag every single data set within the catalog. Getting a list of who has access to which data is quite simple as most data warehouses and databases have a table with this information. Data usage can be analyzed via log usage across all databases and query history, or the workloads that are running on the databases and data lakes.
Data governance can be understood by defining the outcomes that are expected from the data governance staff, such as:
- Compliance: Data lifecycle and usage in accordance with laws and regulations
- Privacy: Protect data as per regulations and user expectations
- Security: Data and data infrastructure is adequately protected
The process of implementing data governance measures can be hard because of the following reasons:
- There is too much data: If one has a lot of data and the ability to join different data sets, they don’t know what kind of insights are there. This trend has continued to grow and there’s more and more data being generated. This leads to more data being shared, not just among private citizens, but also with companies. So, one cannot be sure of the kind of insights being gained from all this data and whether it is harmful.
- There is too much complexity: The trend in industry is to have a product for every single niche, and now there are many different infrastructure pieces that one puts together where each of them solve a niche. These pieces have 8-10 components and each component needs to be protected at the same level. This becomes hard because different projects and different commercial products have different capabilities when it comes to compliance, privacy and security.
- There is no context for data usage: Analytics, data science and AI have competing objectives when it comes to compliance, privacy and security. The former requires complete access to data, whereas the latter is about not giving access to data. Since neither extremes are practical, one needs to have a much more nuanced approach to providing access, and as well as monitoring and auditing usage of data. They also need information on who’s using the data, the purpose of use, etc. to make intelligent decisions about whether the data is being used properly or not.
To get started with the data governance process, one needs to answer the following questions:
- Where is the data?
- Who has access to data?
- How is the data used?
The answers to these questions could be narrowed down to focus on sensitive data instead of all data. The definition of sensitive data differs between companies and sectors. To be able to understand where sensitive data is, one needs three capabilities. They need to have a data catalog, where they can store information or metadata about datasets. The second requirement is that they need to scan the database and recognize sensitive data. Finally, they also need the capability to capture data lineage to tag every single data set within the catalog. One can build a lineage graph from query history and use graph algorithms to track and tag columns with sensitive data in the data catalog. This can be visualized using Python libraries.
The way that these three capabilities work together is that one scans their base datasets and finds out where the sensitive data is. Then they use data lineage to track how the data moves through the derived data sets and data infrastructure across the production database and data warehouses to the data lakes and S3 buckets. With data lineage, one can automate tagging of sensitive data in derived data sets.
Once one has a data catalog which has all the PII data tagged, the next step is to find out who has access to it. Most data warehouses and databases have a table which lists out the privileges or the access control that all the users of the database have. For example, one can use AWS Glue to get a list of which users have access to which tables or columns. This list can be audited regularly and can be analyzed to improve access controls.
To find out how sensitive data is being used, one first needs to log usage across all databases and know the query history, or the workloads that are running on the databases and data lakes. Most databases and data warehouses store query history and information schema. Data technologies like Presto on Spark have hooks where one can capture data usage, and log and store the usage history somewhere. Using this, one can start looking for patterns to see if there is any misuse of data. In the case of production databases, one can use proxies to give access to the operations team through that.
Some resources are:
Tech stack/Tech solutions:
AWS Glue
Apache Spark
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}