Sep 2024

9 Mon

10 Tue

11 Wed

12 Thu

13 Fri 07:00 PM – 08:00 PM IST

14 Sat

15 Sun

Oct 2024

21 Mon

22 Tue

23 Wed

24 Thu

25 Fri 07:00 PM – 08:00 PM IST

26 Sat

27 Sun

All submissions

Previous Next

Meeting Report: July Round Table

Submitted Aug 1, 2024

AI and Data Privacy Discussion

Note: This discussion follows Chatham House Rules. All personal information has been anonymized.

1. Challenges with Current AI Models

1.1 Data Transparency

Most AI models are trained on undisclosed datasets, making it difficult for users to understand potential biases or limitations.
Even “open-source” models often don’t reveal full information about training data. This lack of transparency means these models are better classified as “open-weights” rather than truly open-source.
Participants expressed concern that companies like OpenAI, Meta, and others don’t provide detailed information about their training data sources.
This lack of transparency poses risks for users and organizations, as they can’t fully assess the potential legal or ethical implications of using these models.

1.2 Licensing Issues

Many datasets used for training have restrictive licenses (e.g., Creative Commons NC, GPL).
Participants discussed the risk of inadvertently infringing on these licenses when using AI-generated content, especially in commercial settings.
The group noted that the legal landscape is unclear regarding the use of AI-generated content derived from copyrighted or restrictively licensed material.
There was a discussion about how companies like Microsoft have banned or discouraged the use of certain AI tools for writing code due to these licensing concerns.
The group highlighted the challenge of tracing the origins of AI-generated content and ensuring compliance with various licensing requirements.

2. Data Privacy Concerns

2.1 Personal Data Protection

Discussion of DPDP (Digital Personal Data Protection) Act and its implications:
- Participants noted that the DPDPA is still being developed and not yet fully enforced in India.
- There was debate about whether the law applies to individuals using AI for personal or domestic purposes.
- The group discussed how the law might classify different entities as data fiduciaries and the varying levels of responsibility this entails.
- Concerns were raised about the competence of authorities to enforce such laws effectively.
Concerns about how personal data is handled during corporate acquisitions or mergers:
- A participant raised a scenario of a bank acquiring an insurance company and how customer data would be treated in such cases.
- The group discussed the need for updated terms and conditions and the option for customers to opt-out or delete their data in such scenarios.
- It was noted that critical sectors like health and BFSI might be covered by additional sectoral laws on top of DPDPA.
Challenges in enforcing data protection laws and regulations:
- Participants expressed skepticism about the current enforcement of existing data protection laws.
- There was discussion about the need for clearer guidelines and stronger enforcement mechanisms.

2.2 Data Deletion and Inferences

While it may be possible to delete raw data, removing analytics and inferences is challenging:
- The group discussed how even if personal identifiable information (PII) is deleted, inferences and analytics derived from that data often remain.
- There was concern about the impossibility of removing data influence once it has been used to train AI models.
AI models retain knowledge derived from training data, making complete removal of personal information difficult:
- Participants noted instances where AI models have reproduced verbatim content from training data, raising privacy concerns.
- The group discussed the need for better understanding of how AI models store and use information to address these privacy issues.

2.3 Anonymization Challenges

Importance of proper data anonymization before using it to train AI models:
- The group discussed various anonymization techniques, such as removing names, using age ranges instead of exact ages, and generalizing location data.
- Participants stressed the need for anonymization in sectors like healthcare, where patient privacy is crucial.
Risk of de-anonymization through secondary or tertiary characteristics:
- Concerns were raised about the ability of AI to link seemingly unrelated pieces of information to identify individuals.
- The group discussed how even anonymized data could potentially be de-anonymized using advanced AI techniques.
Need for advanced anonymization techniques, considering findings from differential privacy domain:
- Participants with technical backgrounds mentioned the importance of differential privacy concepts in data anonymization.
- The group agreed on the need for continuous research and development in anonymization techniques to keep pace with AI advancements.

3. AI in Specific Sectors

3.1 Healthcare

Discussion on pseudonymization of health records:
- Participants noted that health records are often pseudonymized rather than fully anonymized to retain necessary medical information.
- The group discussed the balance between protecting patient privacy and maintaining data utility for research and diagnosis.
Storing data in ranges (e.g., age groups) instead of exact values:
- A participant shared that many healthcare AI tools store data in ranges (e.g., 20-30 years old) rather than exact ages to protect privacy.
- The group discussed how this approach could be applied to other types of data to enhance privacy while maintaining usefulness.
Balancing privacy with the need for accurate diagnosis and research:
- Participants acknowledged the tension between protecting individual privacy and advancing medical research and diagnosis capabilities.
- The group discussed the need for clear guidelines and ethical frameworks for using AI in healthcare settings.

3.2 Education

Challenges in using AI for academic purposes:
- The group discussed how universities in the US were initially excited about using AI like ChatGPT for academic purposes but soon faced legal and ethical challenges.
- Participants noted the difficulty in balancing the benefits of AI in education with concerns about academic integrity and data privacy.
Legal hurdles in using student-submitted work to train AI models:
- A participant shared that in many cases, student-submitted work legally belongs to the student, limiting institutions’ ability to use this data for AI training.
- The group discussed the implications of this for developing AI tools specifically for educational contexts.
Need for clear guidelines on AI use in educational settings:
- Participants agreed that educational institutions need to develop clear policies on AI use, both for students and for institutional purposes.
- The group discussed the importance of educating students about AI capabilities and limitations in academic contexts.

4. Technical Considerations

4.1 GPU Dependency and Geopolitical Risks

Most AI models rely on GPUs owned by entities outside India:
- Participants discussed the heavy reliance on US-based companies for GPU hardware and cloud services necessary for running large AI models.
- Concerns were raised about the potential for service disruptions due to geopolitical tensions, citing the example of restrictions placed on Russia during the Ukraine conflict.
Potential risks if geopolitical tensions lead to service embargoes:
- The group discussed scenarios where access to crucial AI infrastructure could be cut off due to international conflicts or sanctions.
- Participants emphasized the need for countries to develop local AI capabilities to mitigate these risks.
Importance of developing local AI capabilities and infrastructure:
- The group discussed the need for countries like India to invest in their own AI infrastructure and GPU manufacturing capabilities.
- Participants noted ongoing efforts in various countries to reduce dependency on foreign technology for critical AI applications.

4.2 Open-Source Models as a Solution

Advantages of open-source or “open weights” models that can be run locally:
- Participants discussed how open-source models like Meta’s Llama 3.1 offer capabilities similar to proprietary models but with more flexibility.
- The group noted that these models can be run on local infrastructure, reducing dependency on foreign cloud services.
Examples like Meta’s Llama 3.1 offering capabilities similar to proprietary models:
- A participant shared their experience comparing Llama 3.1 to GPT-4, noting similar levels of capability in many areas.
- The group discussed the rapid progress in open-source AI models and their potential to democratize AI technology.
Trend towards more accessible and efficient AI models:
- Participants noted the development of smaller, more efficient models that can run in browsers or on edge devices.
- The group discussed how this trend could lead to more privacy-preserving AI applications, as data processing could happen locally rather than on remote servers.

5. Practical Tips for Protecting Data

5.1 When Using AI Services

Avoid posting full queries or code snippets:
- Participants shared strategies for protecting sensitive information when using AI tools, such as breaking up queries into smaller, less identifiable parts.
- The group discussed the importance of being cautious about what information is shared with AI services, even if they claim to be secure.
Remove or anonymize proprietary information:
- A participant shared their practice of removing database names, method names, and other identifiable information before querying AI platforms.
- The group discussed tools and techniques for automating this process to make it more efficient and reliable.
Use placeholder names instead of actual identifiers:
- Participants suggested using generic names like “Fubar” instead of actual project or company names when interacting with AI tools.
- The group discussed the balance between providing enough context for useful AI responses and protecting sensitive information.

5.2 For Developers Integrating AI APIs

Investigate anonymization libraries and tools:
- Participants mentioned tools like Langchain and Llamaindex that offer built-in anonymization features.
- The group discussed the need for developers to thoroughly evaluate these tools and understand their limitations.
Consider using PII (Personally Identifiable Information) redactor models:
- A participant shared their experience using specialized models designed to identify and redact PII from text.
- The group discussed the challenges of handling unstructured data and the need for robust PII detection algorithms.
Explore token replacement techniques to protect sensitive information:
- Participants discussed methods of replacing sensitive tokens in text with placeholders before processing with AI, then replacing them afterward.
- The group noted the complexity of implementing this approach effectively, especially for real-time applications.

5.3 Enterprise Solutions

Discussion of enterprise-grade tools like Portkey for data protection:
- Participants mentioned commercial solutions designed specifically for enterprise-level AI data protection.
- The group discussed the trade-offs between using these solutions and developing in-house tools.
Exploration of open-source alternatives for anonymization and data protection:
- Participants expressed interest in finding or developing open-source tools that could provide similar functionality to commercial solutions.
- The group discussed the potential for community-driven development of privacy-preserving AI tools.

7. Future Directions and Action Items

Maintain and update the comprehensive table comparing how different AI platforms handle user data:
- The group will collaboratively maintain this table.
- Participants will regularly review and update the information on opt-out options, licensing risks, and data usage policies for various AI platforms.
- The community will be encouraged to contribute new findings and report changes in platform policies.
Evolve best practices for using AI tools while protecting proprietary information:
- Building on the existing document, the group will continuously refine guidelines for safe AI usage in professional settings.
- Participants will share experiences and techniques for anonymizing queries, protecting sensitive data, and avoiding inadvertent disclosure of proprietary information.
- The document will be regularly updated to reflect new challenges and solutions identified by the community.
Investigate and compare anonymization tools and libraries:
- Participants will research and test various open-source and commercial anonymization tools.
- The group will create a shared repository of findings, including pros and cons, ease of use, and effectiveness of different tools.
- Regular updates and discussions will be held to share new discoveries and best practices for implementing these tools in various scenarios.
Collaborate on open-source solutions for data protection in AI applications:
- The group will identify gaps in existing open-source tools for AI data protection.
- Participants with technical expertise will be encouraged to contribute to or initiate open-source projects addressing these gaps.
- Regular progress updates and collaboration sessions will be organized to support these development efforts.
Stay informed about evolving regulations and legal frameworks:
- Participants will monitor key sources of information on AI and data privacy regulations.
- The group will establish a system for sharing important regulatory updates, possibly through a shared document or regular email digests.
- Periodic discussions will be held to analyze the implications of new regulations on AI use in various sectors.