Make a submission

Accepting submissions till 28 Feb 2022, 11:00 AM

What are lean data practices and how can you adopt it for compliance? How do you handle user data deletion requests at an exobyte scale? How can you anonymize PII while also sharing data with third party tools and services? What data governance strategies do the best organizations in India follow?

The Privacy Mode Best Practices Guide is a compendium of answers to these, and other questions around privacy and data security. Compiled from talks, interviews, focus group discussions, the BPG guide is a practitioner’s view of implementing better privacy from the design stage, and ensuring compliance with national and international laws.

Each submission is a chapter of the BPG, and will cover one or more of the following topics

  • Data asset enumeration
  • Data flow enumeration
  • Data classification
  • Access control based on classification

Hosted by

Deep dives into privacy and security, and understanding needs of the Indian tech ecosystem through guides, research, collaboration, events and conferences. Sponsors: Privacy Mode’s programmes are sponsored by: more

Supported by

Omidyar Network India invests in bold entrepreneurs who help create a meaningful life for every Indian, especially the hundreds of millions of Indians in low-income and lower-middle-income populations, ranging from the poorest among us to the existing middle class. To drive empowerment and social i… more
We’re the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. As a hyperscale cloud service provider, AWS provides access to highly advanced computing tools on rent for startups and SMEs at affordable prices. We help t… more

Anwesha Sen

@anwesha25

Best Practices Guide: Data Governance 101

Submitted Feb 27, 2022

Name of Organization: LinkedIn

Domain: Social Networking Platform

Talk by Rajat Venkatesh

Summary

Data compliance, privacy and security is a long journey if one doesn’t have any systems in place. They should start with 3 simple questions:

  • Where is the data?
  • Who has access to data?
  • How is the data used?

To understand where data is, one needs a data catalog, recognize sensitive data, and to capture data lineage to tag every single data set within the catalog. Getting a list of who has access to which data is quite simple as most data warehouses and databases have a table with this information. Data usage can be analyzed via log usage across all databases and query history, or the workloads that are running on the databases and data lakes.

Detailed study

Data governance can be understood by defining the outcomes that are expected from the data governance staff, such as:

  • Compliance: Data lifecycle and usage in accordance with laws and regulations
  • Privacy: Protect data as per regulations and user expectations
  • Security: Data and data infrastructure is adequately protected

The process of implementing data governance measures can be hard because of the following reasons:

  • There is too much data: If one has a lot of data and the ability to join different data sets, they don’t know what kind of insights are there. This trend has continued to grow and there’s more and more data being generated. This leads to more data being shared, not just among private citizens, but also with companies. So, one cannot be sure of the kind of insights being gained from all this data and whether it is harmful.
  • There is too much complexity: The trend in industry is to have a product for every single niche, and now there are many different infrastructure pieces that one puts together where each of them solve a niche. These pieces have 8-10 components and each component needs to be protected at the same level. This becomes hard because different projects and different commercial products have different capabilities when it comes to compliance, privacy and security.
  • There is no context for data usage: Analytics, data science and AI have competing objectives when it comes to compliance, privacy and security. The former requires complete access to data, whereas the latter is about not giving access to data. Since neither extremes are practical, one needs to have a much more nuanced approach to providing access, and as well as monitoring and auditing usage of data. They also need information on who’s using the data, the purpose of use, etc. to make intelligent decisions about whether the data is being used properly or not.

To get started with the data governance process, one needs to answer the following questions:

  • Where is the data?
  • Who has access to data?
  • How is the data used?

The answers to these questions could be narrowed down to focus on sensitive data instead of all data. The definition of sensitive data differs between companies and sectors. To be able to understand where sensitive data is, one needs three capabilities. They need to have a data catalog, where they can store information or metadata about datasets. The second requirement is that they need to scan the database and recognize sensitive data. Finally, they also need the capability to capture data lineage to tag every single data set within the catalog. One can build a lineage graph from query history and use graph algorithms to track and tag columns with sensitive data in the data catalog. This can be visualized using Python libraries.

The way that these three capabilities work together is that one scans their base datasets and finds out where the sensitive data is. Then they use data lineage to track how the data moves through the derived data sets and data infrastructure across the production database and data warehouses to the data lakes and S3 buckets. With data lineage, one can automate tagging of sensitive data in derived data sets.

Once one has a data catalog which has all the PII data tagged, the next step is to find out who has access to it. Most data warehouses and databases have a table which lists out the privileges or the access control that all the users of the database have. For example, one can use AWS Glue to get a list of which users have access to which tables or columns. This list can be audited regularly and can be analyzed to improve access controls.

To find out how sensitive data is being used, one first needs to log usage across all databases and know the query history, or the workloads that are running on the databases and data lakes. Most databases and data warehouses store query history and information schema. Data technologies like Presto on Spark have hooks where one can capture data usage, and log and store the usage history somewhere. Using this, one can start looking for patterns to see if there is any misuse of data. In the case of production databases, one can use proxies to give access to the operations team through that.

Some resources are:

Tech stack/Tech solutions:
AWS Glue
Apache Spark

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Make a submission

Accepting submissions till 28 Feb 2022, 11:00 AM

Hosted by

Deep dives into privacy and security, and understanding needs of the Indian tech ecosystem through guides, research, collaboration, events and conferences. Sponsors: Privacy Mode’s programmes are sponsored by: more

Supported by

Omidyar Network India invests in bold entrepreneurs who help create a meaningful life for every Indian, especially the hundreds of millions of Indians in low-income and lower-middle-income populations, ranging from the poorest among us to the existing middle class. To drive empowerment and social i… more
We’re the world’s most comprehensive and broadly adopted cloud platform, offering over 200 fully featured services from data centers globally. As a hyperscale cloud service provider, AWS provides access to highly advanced computing tools on rent for startups and SMEs at affordable prices. We help t… more