Responsible AI

Empowering developers to build beneficial AI systems

Mayank Kumar

@munk

Anwesha Sen

@anwesha25 Editor

Meeting Report: July Round Table

Submitted Aug 1, 2024

AI and Data Privacy Discussion

Note: This discussion follows Chatham House Rules. All personal information has been anonymized.

1. Challenges with Current AI Models

1.1 Data Transparency

  • Most AI models are trained on undisclosed datasets, making it difficult for users to understand potential biases or limitations.
  • Even “open-source” models often don’t reveal full information about training data. This lack of transparency means these models are better classified as “open-weights” rather than truly open-source.
  • Participants expressed concern that companies like OpenAI, Meta, and others don’t provide detailed information about their training data sources.
  • This lack of transparency poses risks for users and organizations, as they can’t fully assess the potential legal or ethical implications of using these models.

1.2 Licensing Issues

  • Many datasets used for training have restrictive licenses (e.g., Creative Commons NC, GPL).
  • Participants discussed the risk of inadvertently infringing on these licenses when using AI-generated content, especially in commercial settings.
  • The group noted that the legal landscape is unclear regarding the use of AI-generated content derived from copyrighted or restrictively licensed material.
  • There was a discussion about how companies like Microsoft have banned or discouraged the use of certain AI tools for writing code due to these licensing concerns.
  • The group highlighted the challenge of tracing the origins of AI-generated content and ensuring compliance with various licensing requirements.

2. Data Privacy Concerns

2.1 Personal Data Protection

  • Discussion of DPDP (Digital Personal Data Protection) Act and its implications:

    • Participants noted that the DPDPA is still being developed and not yet fully enforced in India.
    • There was debate about whether the law applies to individuals using AI for personal or domestic purposes.
    • The group discussed how the law might classify different entities as data fiduciaries and the varying levels of responsibility this entails.
    • Concerns were raised about the competence of authorities to enforce such laws effectively.
  • Concerns about how personal data is handled during corporate acquisitions or mergers:

    • A participant raised a scenario of a bank acquiring an insurance company and how customer data would be treated in such cases.
    • The group discussed the need for updated terms and conditions and the option for customers to opt-out or delete their data in such scenarios.
    • It was noted that critical sectors like health and BFSI might be covered by additional sectoral laws on top of DPDPA.
  • Challenges in enforcing data protection laws and regulations:

    • Participants expressed skepticism about the current enforcement of existing data protection laws.
    • There was discussion about the need for clearer guidelines and stronger enforcement mechanisms.

2.2 Data Deletion and Inferences

  • While it may be possible to delete raw data, removing analytics and inferences is challenging:

    • The group discussed how even if personal identifiable information (PII) is deleted, inferences and analytics derived from that data often remain.
    • There was concern about the impossibility of removing data influence once it has been used to train AI models.
  • AI models retain knowledge derived from training data, making complete removal of personal information difficult:

    • Participants noted instances where AI models have reproduced verbatim content from training data, raising privacy concerns.
    • The group discussed the need for better understanding of how AI models store and use information to address these privacy issues.

2.3 Anonymization Challenges

  • Importance of proper data anonymization before using it to train AI models:

    • The group discussed various anonymization techniques, such as removing names, using age ranges instead of exact ages, and generalizing location data.
    • Participants stressed the need for anonymization in sectors like healthcare, where patient privacy is crucial.
  • Risk of de-anonymization through secondary or tertiary characteristics:

    • Concerns were raised about the ability of AI to link seemingly unrelated pieces of information to identify individuals.
    • The group discussed how even anonymized data could potentially be de-anonymized using advanced AI techniques.
  • Need for advanced anonymization techniques, considering findings from differential privacy domain:

    • Participants with technical backgrounds mentioned the importance of differential privacy concepts in data anonymization.
    • The group agreed on the need for continuous research and development in anonymization techniques to keep pace with AI advancements.

3. AI in Specific Sectors

3.1 Healthcare

  • Discussion on pseudonymization of health records:

    • Participants noted that health records are often pseudonymized rather than fully anonymized to retain necessary medical information.
    • The group discussed the balance between protecting patient privacy and maintaining data utility for research and diagnosis.
  • Storing data in ranges (e.g., age groups) instead of exact values:

    • A participant shared that many healthcare AI tools store data in ranges (e.g., 20-30 years old) rather than exact ages to protect privacy.
    • The group discussed how this approach could be applied to other types of data to enhance privacy while maintaining usefulness.
  • Balancing privacy with the need for accurate diagnosis and research:

    • Participants acknowledged the tension between protecting individual privacy and advancing medical research and diagnosis capabilities.
    • The group discussed the need for clear guidelines and ethical frameworks for using AI in healthcare settings.

3.2 Education

  • Challenges in using AI for academic purposes:

    • The group discussed how universities in the US were initially excited about using AI like ChatGPT for academic purposes but soon faced legal and ethical challenges.
    • Participants noted the difficulty in balancing the benefits of AI in education with concerns about academic integrity and data privacy.
  • Legal hurdles in using student-submitted work to train AI models:

    • A participant shared that in many cases, student-submitted work legally belongs to the student, limiting institutions’ ability to use this data for AI training.
    • The group discussed the implications of this for developing AI tools specifically for educational contexts.
  • Need for clear guidelines on AI use in educational settings:

    • Participants agreed that educational institutions need to develop clear policies on AI use, both for students and for institutional purposes.
    • The group discussed the importance of educating students about AI capabilities and limitations in academic contexts.

4. Technical Considerations

4.1 GPU Dependency and Geopolitical Risks

  • Most AI models rely on GPUs owned by entities outside India:

    • Participants discussed the heavy reliance on US-based companies for GPU hardware and cloud services necessary for running large AI models.
    • Concerns were raised about the potential for service disruptions due to geopolitical tensions, citing the example of restrictions placed on Russia during the Ukraine conflict.
  • Potential risks if geopolitical tensions lead to service embargoes:

    • The group discussed scenarios where access to crucial AI infrastructure could be cut off due to international conflicts or sanctions.
    • Participants emphasized the need for countries to develop local AI capabilities to mitigate these risks.
  • Importance of developing local AI capabilities and infrastructure:

    • The group discussed the need for countries like India to invest in their own AI infrastructure and GPU manufacturing capabilities.
    • Participants noted ongoing efforts in various countries to reduce dependency on foreign technology for critical AI applications.

4.2 Open-Source Models as a Solution

  • Advantages of open-source or “open weights” models that can be run locally:

    • Participants discussed how open-source models like Meta’s Llama 3.1 offer capabilities similar to proprietary models but with more flexibility.
    • The group noted that these models can be run on local infrastructure, reducing dependency on foreign cloud services.
  • Examples like Meta’s Llama 3.1 offering capabilities similar to proprietary models:

    • A participant shared their experience comparing Llama 3.1 to GPT-4, noting similar levels of capability in many areas.
    • The group discussed the rapid progress in open-source AI models and their potential to democratize AI technology.
  • Trend towards more accessible and efficient AI models:

    • Participants noted the development of smaller, more efficient models that can run in browsers or on edge devices.
    • The group discussed how this trend could lead to more privacy-preserving AI applications, as data processing could happen locally rather than on remote servers.

5. Practical Tips for Protecting Data

5.1 When Using AI Services

  • Avoid posting full queries or code snippets:

    • Participants shared strategies for protecting sensitive information when using AI tools, such as breaking up queries into smaller, less identifiable parts.
    • The group discussed the importance of being cautious about what information is shared with AI services, even if they claim to be secure.
  • Remove or anonymize proprietary information:

    • A participant shared their practice of removing database names, method names, and other identifiable information before querying AI platforms.
    • The group discussed tools and techniques for automating this process to make it more efficient and reliable.
  • Use placeholder names instead of actual identifiers:

    • Participants suggested using generic names like “Fubar” instead of actual project or company names when interacting with AI tools.
    • The group discussed the balance between providing enough context for useful AI responses and protecting sensitive information.

5.2 For Developers Integrating AI APIs

  • Investigate anonymization libraries and tools:

    • Participants mentioned tools like Langchain and Llamaindex that offer built-in anonymization features.
    • The group discussed the need for developers to thoroughly evaluate these tools and understand their limitations.
  • Consider using PII (Personally Identifiable Information) redactor models:

    • A participant shared their experience using specialized models designed to identify and redact PII from text.
    • The group discussed the challenges of handling unstructured data and the need for robust PII detection algorithms.
  • Explore token replacement techniques to protect sensitive information:

    • Participants discussed methods of replacing sensitive tokens in text with placeholders before processing with AI, then replacing them afterward.
    • The group noted the complexity of implementing this approach effectively, especially for real-time applications.

5.3 Enterprise Solutions

  • Discussion of enterprise-grade tools like Portkey for data protection:

    • Participants mentioned commercial solutions designed specifically for enterprise-level AI data protection.
    • The group discussed the trade-offs between using these solutions and developing in-house tools.
  • Exploration of open-source alternatives for anonymization and data protection:

    • Participants expressed interest in finding or developing open-source tools that could provide similar functionality to commercial solutions.
    • The group discussed the potential for community-driven development of privacy-preserving AI tools.

7. Future Directions and Action Items

  • Maintain and update the comprehensive table comparing how different AI platforms handle user data:

    • The group will collaboratively maintain this table.
    • Participants will regularly review and update the information on opt-out options, licensing risks, and data usage policies for various AI platforms.
    • The community will be encouraged to contribute new findings and report changes in platform policies.
  • Evolve best practices for using AI tools while protecting proprietary information:

    • Building on the existing document, the group will continuously refine guidelines for safe AI usage in professional settings.
    • Participants will share experiences and techniques for anonymizing queries, protecting sensitive data, and avoiding inadvertent disclosure of proprietary information.
    • The document will be regularly updated to reflect new challenges and solutions identified by the community.
  • Investigate and compare anonymization tools and libraries:

    • Participants will research and test various open-source and commercial anonymization tools.
    • The group will create a shared repository of findings, including pros and cons, ease of use, and effectiveness of different tools.
    • Regular updates and discussions will be held to share new discoveries and best practices for implementing these tools in various scenarios.
  • Collaborate on open-source solutions for data protection in AI applications:

    • The group will identify gaps in existing open-source tools for AI data protection.
    • Participants with technical expertise will be encouraged to contribute to or initiate open-source projects addressing these gaps.
    • Regular progress updates and collaboration sessions will be organized to support these development efforts.
  • Stay informed about evolving regulations and legal frameworks:

    • Participants will monitor key sources of information on AI and data privacy regulations.
    • The group will establish a system for sharing important regulatory updates, possibly through a shared document or regular email digests.
    • Periodic discussions will be held to analyze the implications of new regulations on AI use in various sectors.

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We're committed to understanding and communicating the intricacies and possibilities of the community owned internet.