Best Practices Guide: Data Governance 101

May 2022

16 Mon

17 Tue 04:00 PM – 04:50 PM IST

18 Wed

19 Thu

20 Fri 04:00 PM – 05:00 PM IST

21 Sat

22 Sun

May 2022

23 Mon

24 Tue 04:00 PM – 05:00 PM IST

25 Wed

26 Thu

27 Fri 04:00 PM – 05:00 PM IST

28 Sat

29 Sun

May 2022

30 Mon

31 Tue 04:00 PM – 05:00 PM IST

1 Wed

2 Thu

3 Fri 04:00 PM – 04:40 PM IST

4 Sat

5 Sun

Jun 2022

6 Mon

7 Tue 04:00 PM – 05:05 PM IST

8 Wed

9 Thu

10 Fri 04:00 PM – 05:35 PM IST

11 Sat

12 Sun

Jun 2022

13 Mon

14 Tue 04:00 PM – 04:40 PM IST

15 Wed

16 Thu

17 Fri 04:00 PM – 04:40 PM IST

18 Sat

19 Sun

Jun 2022

20 Mon

21 Tue

22 Wed 06:30 PM – 07:30 PM IST

23 Thu

24 Fri

25 Sat

26 Sun

All submissions

Previous Next

Best Practices Guide: Data Governance 101

Submitted Feb 27, 2022

Name of Organization: LinkedIn

Summary

Data compliance, privacy and security is a long journey if one doesn’t have any systems in place. They should start with 3 simple questions:

Where is the data?
Who has access to data?
How is the data used?

To understand where data is, one needs a data catalog, recognize sensitive data, and to capture data lineage to tag every single data set within the catalog. Getting a list of who has access to which data is quite simple as most data warehouses and databases have a table with this information. Data usage can be analyzed via log usage across all databases and query history, or the workloads that are running on the databases and data lakes.

Detailed study

Data governance can be understood by defining the outcomes that are expected from the data governance staff, such as:

Compliance: Data lifecycle and usage in accordance with laws and regulations
Privacy: Protect data as per regulations and user expectations
Security: Data and data infrastructure is adequately protected

The process of implementing data governance measures can be hard because of the following reasons:

There is too much data: If one has a lot of data and the ability to join different data sets, they don’t know what kind of insights are there. This trend has continued to grow and there’s more and more data being generated. This leads to more data being shared, not just among private citizens, but also with companies. So, one cannot be sure of the kind of insights being gained from all this data and whether it is harmful.
There is too much complexity: The trend in industry is to have a product for every single niche, and now there are many different infrastructure pieces that one puts together where each of them solve a niche. These pieces have 8-10 components and each component needs to be protected at the same level. This becomes hard because different projects and different commercial products have different capabilities when it comes to compliance, privacy and security.
There is no context for data usage: Analytics, data science and AI have competing objectives when it comes to compliance, privacy and security. The former requires complete access to data, whereas the latter is about not giving access to data. Since neither extremes are practical, one needs to have a much more nuanced approach to providing access, and as well as monitoring and auditing usage of data. They also need information on who’s using the data, the purpose of use, etc. to make intelligent decisions about whether the data is being used properly or not.

To get started with the data governance process, one needs to answer the following questions:

Where is the data?
Who has access to data?
How is the data used?

The answers to these questions could be narrowed down to focus on sensitive data instead of all data. The definition of sensitive data differs between companies and sectors. To be able to understand where sensitive data is, one needs three capabilities. They need to have a data catalog, where they can store information or metadata about datasets. The second requirement is that they need to scan the database and recognize sensitive data. Finally, they also need the capability to capture data lineage to tag every single data set within the catalog. One can build a lineage graph from query history and use graph algorithms to track and tag columns with sensitive data in the data catalog. This can be visualized using Python libraries.

The way that these three capabilities work together is that one scans their base datasets and finds out where the sensitive data is. Then they use data lineage to track how the data moves through the derived data sets and data infrastructure across the production database and data warehouses to the data lakes and S3 buckets. With data lineage, one can automate tagging of sensitive data in derived data sets.

Once one has a data catalog which has all the PII data tagged, the next step is to find out who has access to it. Most data warehouses and databases have a table which lists out the privileges or the access control that all the users of the database have. For example, one can use AWS Glue to get a list of which users have access to which tables or columns. This list can be audited regularly and can be analyzed to improve access controls.

To find out how sensitive data is being used, one first needs to log usage across all databases and know the query history, or the workloads that are running on the databases and data lakes. Most databases and data warehouses store query history and information schema. Data technologies like Presto on Spark have hooks where one can capture data usage, and log and store the usage history somewhere. Using this, one can start looking for patterns to see if there is any misuse of data. In the case of production databases, one can use proxies to give access to the operations team through that.

Some resources are:

Capture query history: https://gitlab.com/gitlab-data/snowflake_spend/
Query history for MySQL: https://tokern.io/blog/proxysql-database-audit/
Github PIICatcher
Github Data-Lineage
Github Lakecli
OSS Simple Data Governance Projects: https://tokern.io/
Data Governance, Privacy, and Security Newsletter: https://dbadminnews.substack.com/

Tech stack/Tech solutions:
AWS Glue
Apache Spark

All submissions

Previous Next

Comments

May 2022

16 Mon

17 Tue 04:00 PM – 04:50 PM IST

18 Wed

19 Thu

20 Fri 04:00 PM – 05:00 PM IST

21 Sat

22 Sun

May 2022

23 Mon

24 Tue 04:00 PM – 05:00 PM IST

25 Wed

26 Thu

27 Fri 04:00 PM – 05:00 PM IST

28 Sat

29 Sun

May 2022

30 Mon

31 Tue 04:00 PM – 05:00 PM IST

1 Wed

2 Thu

3 Fri 04:00 PM – 04:40 PM IST

4 Sat

5 Sun

Jun 2022

6 Mon

7 Tue 04:00 PM – 05:05 PM IST

8 Wed

9 Thu

10 Fri 04:00 PM – 05:35 PM IST

11 Sat

12 Sun

Jun 2022

13 Mon

14 Tue 04:00 PM – 04:40 PM IST

15 Wed

16 Thu

17 Fri 04:00 PM – 04:40 PM IST

18 Sat

19 Sun

Jun 2022

20 Mon

21 Tue

22 Wed 06:30 PM – 07:30 PM IST

23 Thu

24 Fri

25 Sat

26 Sun

Hosted by

Privacy Mode

Deep dives into privacy and security, and understanding needs of the Indian tech ecosystem through guides, research, collaboration, events and conferences. Sponsors: Privacy Mode’s programmes are sponsored by: more

Supported by

Omidyar Network India

Omidyar Network India invests in bold entrepreneurs who help create a meaningful life for every Indian, especially the hundreds of millions of Indians in low-income and lower-middle-income populations, ranging from the poorest among us to the existing middle class. To drive empowerment and social i… more

Privacy Best Practices Guide

Best Practices Guide: Data Governance 101

Name of Organization: LinkedIn

Talk by Rajat Venkatesh

Summary

Detailed study

Comments

Privacy Best Practices Guide

Best Practices Guide: Data Governance 101

Name of Organization: LinkedIn

Domain: Social Networking Platform

Talk by Rajat Venkatesh

Summary

Detailed study

Comments