Zeotap: Privacy in Data as a Service (DaaS) business
Summary
Zeotap, which now has more than one hundred partners, runs a data business offering managed services for data sets. It operates on Data as a Service (DaaS) and Software as a Service (SaaS) models, and provides integrations with cloud exchanges and APIs with SLA guarantees. In order to cater to its EU customers, it became compliant with the General Data Protection Regulation (GDPR) in 2018.
A hybrid, software design architecture is used to implement microservices, and data is persisted across data lakes, databases and in fast lookup stores. The conceptual model uses three layers: rules, logical and processing. Based on the defined policy rules, specific actions are performed on the data sets. It has created its own data catalogue as a reusable component. Data sovereignty is completely within the EU itself.
Zeotap follows many good design principles for consent management, access control, data retention policies, standardisation, certification, auditing and governance to protect user data. With a reusable technology stack for multiple data pipelines and data-driven products, Zeotap has been able to improve its overall privacy ecosystem for both consumers and data providers, and also be compliant for GDPR and Data Protection Bill regulations.
Introduction
Zeotap, which now has more than one hundred partners, offers managed services for data sets and provides integrations with cloud exchanges and APIs with SLA guarantees. It primarily runs a data business, both from a Data as a Service (DaaS) model, and a data platform from a Software as a Service (SaaS) perspective. In order to cater to its EU customers, it became compliant with the General Data Protection Regulation (GDPR) in 2018.
Problem Statement
The GDPR is a law for data protection and privacy in the European Union (EU). It defines the rights and scope of a company in collecting personal data from EU member countries. It states the rights to users, rights to ensure portability, rights to transfer, rights to know what data the company is storing etc. The law defines two entities: a Data Processor, who only transmits data after processing, and a Data Controller, who is the actual owner of data. Companies can be a combination of these entities. A Data Protection Officer assesses the
impact of data, is aware of the business context requirements, the regulations, the standards, and the laws.
The data can be broadly classified as people-centric and non-personal product data. The mobile identifier from a device is an example of personal identification data, while weather or traffic data are examples of non-personal data. Zeotap is in the data business of sourcing, processing and refining data. Hence, it is mandatory to be compliant with GDPR for its EU customers. These regulations affect human resources, marketing, legal, information security, technology and product teams. There are also auditing requirements that need to be implemented within the organisation.
Software Architecture
The software design can be a shared everything or a shared nothing architecture. Zeotap uses a hybrid mode approach. There exists a stateless control plane, and data is in flight using a shared nothing architecture. Microservices and data networks exist to see where data exchange occurs, as there can be multiple data providers. The conceptual model has three layers: rules layer, logical layer and processing layer.
The policy is a rule that is used for filtering. Based on the policy evaluation, an action is taken, such as dropping a set or nullifying data if not used. The logical model has the processing layer and the storage layer. The policy and rules are present in the storage layer. The logical layer has data assets in various formats. These can be plain data sets, user assets, audit and consent assets. The SDK on the website and App acts as a data layer and pushes content to
Zeotap. Consent management occurs first, and then personal identifiable information such as email, phone number, mobile identifier, etc. are managed.
The processing entities consisting of deletion, compliance, and TTL are first class entities. The user has rights to delete data from the system. A separate catalogue for compliance exists in order to support different regulations. Hence, compliance is made as a runtime parameter. The compliance catalogue includes runtime parameters, thresholds and the actions to be taken, which are a function of policy and the parameters. Compliance is the final granularity in the policy.
Design Principles
Consent Management
The cookie consent should be made explicit and granular. It should be simple for the customer to choose their preference for privacy protection, and for data retention policies. The cookies obtained for analytics and marketing for personalization strategies should be clear. It is mandatory to collect consent and log them for legal audits. Consent can also be obtained from phone call or IVR, or through website and email. Consent must also be made available across all downstream systems from an orchestration perspective.
Security and Privacy
Symmetric encryption should be used for data transfers across regions and boundaries. SHA256 is the minimum cryptographic hash function that is accepted. Network security management techniques need to be deployed to protect data with a minimal blast radius. Use of quasi-identifiers instead of personal identifiers is highly recommended. Security mitigation controls and automatic mitigation processes should be defined, and modelling of physical threats is a good practice. Privacy should be formalised with re-identifiability. Good infrastructure, security within your systems and data sovereignty are essential for compliance.
Access Control
The use of role-based access control provides data protection. Permissions and roles need to be defined even for employee data. Privacy enhancing techniques, including commercial solutions, are available today. Further, attribute based access control, such as location attribute, can also be used to limit the time for the access. The customer is the real owner of the data. Nevertheless, we also need to understand the ownership within an organisation (sales, marketing, product teams etc.), and who is consuming the same. This is also essential for auditing.
Data Catalogue
The data catalogue can be a single data model, or have a mesh kind of architecture. There are two semantics when managing data pipelines: push or pull. Based on input data, calculated and derived attributes can further be computed for downstream consumption. Although ML driven semantics can be deployed for data profiling use cases, it is useful to invest in a data catalogue. For a SaaS business model, it is recommended to create your own catalogue as a reusable component.
Data Retention Policy
The data can be obtained in either plain text, or binary formats. The data retention policy can be worked out based on the TTL. The identity data, for example, can have a TTL of 30 days, whereas, profile data can exist for a year. We must provide user data to the consumer within 24 hours, if requested, and also have means to support data portability. The user can also choose to not have their data for more than 90 days. The implementation should be flexible to support policies that can be easily extended or modified.
Standards and Certification
The certifications for an organisation are useful. The ISO 27000 certifications or BSI CSA* from the British Institute benefits companies of all sizes. Depending on the business verticals there can be specific regulations. For example, the Interactive Advertising Bureau (IAB) consortium for the advertising industry is governed by the Transparency and Consent Framework (TCF). The California Consumer Privacy Act (CCPA) is applicable in the state of California in the US. The Health Insurance Portability and Accountability Act (HIPAA) compliance is required by organisations working in the healthcare
industry. There are also other practical data governance frameworks like Linddun and Network and Information Security Directive (NISD).
Auditing and Governance
The compliance for a company needs to be reviewed quarterly, annually or bi-annualy, especially when data assets change. Good governance is required for compliance management of data, security and privacy. The privacy implementation assists the Data Protection Officer in making the data impact assessment reports. An independent, external auditor can also help in advising recommendations for better scrutiny and compliance. It is important to train the organisation staff in data security and privacy.
Reference Implementation
Zeotap used a bottom-up approach for its privacy implementation. The complete solution is built on RDBMS and ElasticSearch using microservices. The backend services are implemented using Java and Golang, while the data pipeline is implemented in Scala. Bloom is used for filtering. The solution has capabilities around registration, onboarding and updates during processing, quality metrics and verification semantics.
The format of the consent data is standardised. A new consent object is created, stored and archived for easier deletion operation, and tag IDs are added by the consent processing layer before sending them for downstream consumption. Log grammar examples have been implemented to detect any violation types. A proper grammar is used to run any SQL query.
Zeotap provides three means for opting-out consent. Firstly, the global order App store using either the iOS or Android App. Secondly, through data partners. Thirdly, consumers can directly opt out through their Facebook, Google logins. The privacy App and website is a backend API architecture, and for GDPR compliance, Zeotap is able to delete data within 72 hours across systems.
The data is stored in data lakes, HBase data system, and a truncated copy is available in a fast lookup store. It is thus essential to know where data is stored, and for what purpose. We also need to know which data partner contributed to some particular source of knowledge. Hence, lineage is important. The second aspect of lineage is for conflict resolution.
For governance, Zeotap stores data catalogue, and policy store compliance catalogue. The path catalogue is a repository of registered paths where the data comes from, and will be triggered based on events for downstream processing. From the data set in the catalogue, the schema level policy is applied and necessary action is performed. Schema level granularity applies for the entire data set. Audit logs are also generated in the process. In the end, we get a compliant data set. So this is a Spark pipeline which is consumed by
other downstream applications.
Zeotap provides an API and also hosts an SFTP for cloud exchanges for partners, where data is received daily. It adheres to the TCF created by IAB because AdTech is its primary business. Data sovereignty is completely within the EU itself. There are no cross-border data transfers in Zeotap. The implementation from both a product and technology perspective has helped Zeotap towards compliance of GDPR.
Conclusion
Zeotap runs a data business for third-parties using a DaaS model, and a SaaS business for users. Its implementation of a reusable technology stack for multiple data pipelines and for data-driven products has helped it improve its overall privacy ecosystem for both consumers and data providers, and also be compliant for GDPR and Data Protection Bill requirements. Zeotap is also actively researching differential privacy techniques to add statistical noise to make identifiability of individual data nearly impossible.
Comments
Hosted by
Deep dives into privacy and security, and understanding needs of the Indian tech ecosystem through guides, research, collaboration, events and conferences. Sponsors: Privacy Mode’s programmes are sponsored by:
more
Supported by
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}