On 28 April, the Data Privacy Product and Engineering Conference held a Birds of Feather (BOF) session about handling data deletion requests from users under privacy laws, and how Indian companies service this request. The session was moderated by Venkata Pingali, co-founder at Scribble Data. Sreenath Kamath of Hotstar and Sheik Idris of Zeta participated in this session.
This session was organized under Chatham House Rules. The following summary therefore does not attribute quotes to speakers or participants.
A focussed conference on handling data deletion will be held under the scalable privacy engineering conference on 30 July. Details at https://hasgeek.com/rootconf/scalable-data-privacy-engineering-conf/
The requirement across privacy laws - whether it is CCPA, GDPR or even the proposed PDP Bill - is that users have the right to first discover what a company knows about you and they can ask you to delete all of their information.
However, deleting user data is not as easy as it appears on the surface. When deleting a record of an individual, you can do it in two ways: you can go to the ultimate database or the disk and actually remove that record - truncate, update and delete from sources. The other approach is called a soft delete, which is that as the data is flowing through the system you virtualize it in some way - either anonymize on the fly or drop the record as the policy defines.
The complexity is the interpretation of what can be deleted or what is an appropriate action depending upon the source and the objective. Only lawyers can define this. The software system inside the organization has to be flexible to say that for this class of data, for this kind of usage, I’m going to apply this kind of soft delete, hard delete or whatever approach is suitable.
A big part of the challenge is to manage the complexity of all of this and the evidence you can demonstrate to the user about whether you have actually deleted their data or not. Evidence is important because when an end user goes to the regulatory authority and says that XYZ company hasn’t forgotten them because the user received an email from XYZ company, even when the user told the company to forget their data. Whatever you do inside your organization, guarding the outgoing filter is highly critical.
- Proliferation of the data.
- Mis-naming of data which adds a lot of complexity to discovering what even needs to be deleted.
- Lack of a disciplined data end-map.
- If an organization allows anybody to access data and make any number of copies, then data discovery becomes a very expensive process. Technology can only provide part of the solution in such cases.
- When organizations operate in multiple geographies and have accumulated petabytes of data, it becomes very hard to know where Personally Identifiable Information (PII) resides in this petabytes of data.
Some of the foremost challenges involved in doing data discovery is to identify the pipes and identify the data sources. Identifying data sources can give an organization immense power to innovate, but this identification cannot be done without getting into the user data model itself which violates user privacy.
Streamlining the data model is another big challenge because data duplication takes place very often. The concept of Master Data Management (MDM), which was developed by enterprise architects, needs to be brought back inside organizations as a mainstream practice. MDM is the key for privacy by design.
Make sure you have a single catalog of your data sources, your users, your customers. Never duplicate the data. Duplication means that some business unit decided to clone the database and try to manage on their own. This is another big challenge.
Challenges in the discovery phase include federation of the PIIs across different databases. A common problem is that PIIs are treated as primary keys in most of the databases which might not be right. When you implement right for erasure or right to access, handling such requests becomes very difficult because cascade deletes or cascade deletes derived datasets in your data models can be challenging. Ensure not to use phone numbers or email ids as primary keys. While you can treat this information logically as a primary key, introduce your own primary key.
Abusing column names is another common challenge which goes against the principles of privacy by design. Somebody named their column as ‘column_1’ and later it was discovered that ‘column_1’ contains biometric data information. Engineers and teams tend to store critical data inside randomly named columns which leads to retrospectively fixing database design when PII is already linked via columns.
One practice that organizations must follow is that their analytics teams should only have access to obfuscated data. This is helpful especially when teams are uncoordinated and you cannot take a risk of exposing teams to sensitive data. If the data is sitting in a SQL database and you know that it is being accessed, you can secure PII data if you have already anonymized the data. In an enterprise, there are uncontrolled sources and uncontrolled flows through the entire system.
The most important thing any startup or any company can start with is to guard your communication channels. Your data is spread across so many places - rest data sources, databases, etc. If you guard your email, your telephone, IVR, SMS communication channel, and you maintain a blacklist idea of right to be forgotten users and right to be accessed users, you’re good because there’s a final check. You’ve done everything. You have a fancy model which says, “send this marketing SMS or an email.” But the guardrails and processes will prevent machines and humans in the loop to stop and say, “don’t send it.
Anonymize data sources at the rest layer. Or do it at the query layer.
Depending on what stage you are implementing GDPR in your organization, you can start with a query layer because your data is in petabyte scale. Fix this problem in the query engine. Almost all query engines and big data query engines like Hive, Presto and Spark provide integration with your data input format. Make sure you start at least with a query engine.
You can also provide certain query-based obfuscations such as when you query an email, at the time of returning results to an analyst, make sure that all PIIs are either encrypted or masked. Or, the column is not even shown to the analyst. Then you can move on to anonymization in the rest layer. That’s the second part.
Third, you can do cascading obfuscation where you start with your master data. Then you make sure all your copies are replicated because now you have to refresh your pipelines and the pipeline can have any depth. At each level, re-run the complete pipeline and make sure that entire data at rest is anonymous.