Best Practices Guide: User Data Anonymization and Deletion at LinkedIn

May 2022

16 Mon

17 Tue 04:00 PM – 04:50 PM IST

18 Wed

19 Thu

20 Fri 04:00 PM – 05:00 PM IST

21 Sat

22 Sun

May 2022

23 Mon

24 Tue 04:00 PM – 05:00 PM IST

25 Wed

26 Thu

27 Fri 04:00 PM – 05:00 PM IST

28 Sat

29 Sun

May 2022

30 Mon

31 Tue 04:00 PM – 05:00 PM IST

1 Wed

2 Thu

3 Fri 04:00 PM – 04:40 PM IST

4 Sat

5 Sun

Jun 2022

6 Mon

7 Tue 04:00 PM – 05:05 PM IST

8 Wed

9 Thu

10 Fri 04:00 PM – 05:35 PM IST

11 Sat

12 Sun

Jun 2022

13 Mon

14 Tue 04:00 PM – 04:40 PM IST

15 Wed

16 Thu

17 Fri 04:00 PM – 04:40 PM IST

18 Sat

19 Sun

Jun 2022

20 Mon

21 Tue

22 Wed 06:30 PM – 07:30 PM IST

23 Thu

24 Fri

25 Sat

26 Sun

All submissions

Previous Next

Best Practices Guide: User Data Anonymization and Deletion at LinkedIn

Submitted Jan 18, 2022

Name of Organization: LinkedIn

Talk by Bhupendra Kumar Jain, Pratap Kudupudi

Key Takeaways

LinkedIn maintains large data lakes for various services and applications. The practice described here uses a mix of proprietary technologies and tools, and processes to ensure compliance with GDPR’s user data deletion requirements.

LinkedIn uses an open-sourced data unification service/tool called DALI - Data Access Layer at LinkedIn, and an open-sourced technology called Gobblin to read and process user data. Access to the data is also determined by a layer called DataHub, which is a repository of all meta-data.

Services are performed based on requests, and will involve the DALI, Gobblin, and DataHub. DALI and Gobblin are used by LinkedIn to do

ReadTime Anonymization of PII
WriteTime Anonymization of PII
Data Transformation of user data
Data Deletion/Purging of user data

Terms/Glossary

DALI: Data Access layer at LinkedIn - a layer that sits on top of the HDFS and unifies all data access that comes into the lake
Gobblin: A data - transformation/reading tool that can read data from any source, extract the data, transform it, and write the transformed data to any destination
DataHub: Cache of all schema meta-data about all datasets that exists in the data lake

A view of the data-access and deletion layer at LinkedIn

Detailed study

LinkedIn captures about 50 categories of Personally Identifiable Information (PII) from each user. User generated/inputted data, automated data collection, and derived data based on the previous two categories: These could be name, age, addresses, job history, IP addresses, device information and more, products/marketing information. These are then used to perform a range of tasks - including suggesting networking leads, job leads, and more.

The data is accessed by three major ‘clients’:

The User
The LinkedIn employee
Third Party services/tools

For LinkedIn, data exists in three stages: online data, or data that is currently in use and being accessed, offline data that exists in the data lake, and data in motion in nearlines/pipelines. Each client may access this data/read it in different ways. Semantics vary across use cases.

LinkedIn has to solve two problems. The first: - PII must be anonymized and access limited, but user data needs to be accessible to community/third party services for providing other benefits and services on the LinkedIn platform.
The second: When a User Data Deletion request comes in. User data, offline and online, near lines and pipelines as well as derived data and data accessed by third parties must be deleted simultaneously.

Both these tasks will need to be optimized for scale - LinkedIn currently is operating at an Exobyte scale of data. TO create offline clusters of anonymized user data available to third parties will more than double the data footprint at LinkedIn. At the same time, it will slow down LinkedIn’s own platform capacity.

To solve this, LinkedIn has developed in-house technologies: the DALI, Gobblin and the DataHub - along with open-source technologies Apache-Kafka (also developed in LinkedIn and opensourced), HDFS and Hive LLAP to optimize these processes.

ReadTime and WriteTime anonymization is one process by which LinkedIn optimizes PII anonymization and makes it available to internal and external clients of data.

LinkedIn creates a Lookup Table (LUT) on the DALI layer, with the user name, and inputs a ‘Compliance Line’ - with details of what category of data is required to be transformed. The LUT is then ingested into the Data Lake using the inhouse tool Gobblin, which then performs the required transformations - in this case, anonymizing the PII.

A similar transformation process is followed when user data deletion request comes in. The LookUp Table is created and ingested into the data lake and Gobblin performs the required transformation - which is purging the row/record.
Once the row - user data - has been deleted, the LUT is cached and the data purged.

A similar process is also performed when user requests specific actions: such as not having data visible to hirers, advertisers and third party services on platform.
The LUT will then be created and the ‘Compliance Line’ will only ‘purge’ the data to identified ‘Clients’ - third party services. This transformation happens in the nearlines/pipelines, and in online data.

The DALI reader thus becomes an abstraction layer to filter the data according to requests.

“If you take any DBMS system, know how the partitioning is done, how the data is read on your disk, how the statistics are collected, where the stats are stored, how the data is laid out on what data structures are used to access your data is sort of abstracted if you look at it, right. All that the end user is cognizant about is their existing database. And there exists a table and all the low level details are abstracted from the end user. So, the end user always gets a view of a DB and the table as a row and the column format. And that’s exactly what DALI is trying to abstract or unify for us.”

Tech stack/Tech solutions:
Hadoop/HDFS
Apache Hive / Hive LLAP
Apache Kafka
RDD/Spark

All submissions

Previous Next

Comments

May 2022

16 Mon

17 Tue 04:00 PM – 04:50 PM IST

18 Wed

19 Thu

20 Fri 04:00 PM – 05:00 PM IST

21 Sat

22 Sun

May 2022

23 Mon

24 Tue 04:00 PM – 05:00 PM IST

25 Wed

26 Thu

27 Fri 04:00 PM – 05:00 PM IST

28 Sat

29 Sun

May 2022

30 Mon

31 Tue 04:00 PM – 05:00 PM IST

1 Wed

2 Thu

3 Fri 04:00 PM – 04:40 PM IST

4 Sat

5 Sun

Jun 2022

6 Mon

7 Tue 04:00 PM – 05:05 PM IST

8 Wed

9 Thu

10 Fri 04:00 PM – 05:35 PM IST

11 Sat

12 Sun

Jun 2022

13 Mon

14 Tue 04:00 PM – 04:40 PM IST

15 Wed

16 Thu

17 Fri 04:00 PM – 04:40 PM IST

18 Sat

19 Sun

Jun 2022

20 Mon

21 Tue

22 Wed 06:30 PM – 07:30 PM IST

23 Thu

24 Fri

25 Sat

26 Sun

Hosted by

Privacy Mode

Deep dives into privacy and security, and understanding needs of the Indian tech ecosystem through guides, research, collaboration, events and conferences. Sponsors: Privacy Mode’s programmes are sponsored by: more

Supported by

Omidyar Network India

Omidyar Network India invests in bold entrepreneurs who help create a meaningful life for every Indian, especially the hundreds of millions of Indians in low-income and lower-middle-income populations, ranging from the poorest among us to the existing middle class. To drive empowerment and social i… more

Privacy Best Practices Guide

Best Practices Guide: User Data Anonymization and Deletion at LinkedIn

Name of Organization: LinkedIn

Domain: Social/Networking/Platform

Talk by Bhupendra Kumar Jain, Pratap Kudupudi

Key Takeaways

Detailed study

Comments