The Fifth Elephant 2024 Annual Conference (12th &13th July)

Maximising the Potential of Data — Discussions around data science, machine learning & AI

Chaitanya

Thejesh GN

@thej Author

Need for new licenses in this age of Generative AI

Submitted Jun 1, 2024

Table of Contents

  1. Introduction

    • The disruptive nature of AI technology
    • Questions about current licenses
  2. The Current Landscape

    • Fifth Elephant conference discussion
    • Parallels to open source vs. proprietary software debates
    • Example of low-cost Telugu ASR model development
  3. Key Issues

    • Need for strong copyright
    • Copyleft and its effectiveness
    • Level of openness in AI models
    • Rights and duties framing
    • Data collection and ownership
    • Model training and usage
    • Open washing concerns
    • Enforcement challenges
    • License creation approaches
    • Cloud and API scenarios
  4. The Path Forward

    • Tiered openness
    • Transitive licensing
    • Use-case differentiation
    • Community standards
    • Call for transparency
    • New license development
  5. A Call to Action

    • Importance of community involvement
    • Invitation for diverse perspectives

Discussion Summary

The Need for New Licenses in the Age of AI

Is AI as a technology disruptive enough that we need to come up with new licenses? Are current licenses good enough to protect creators and content? Are licenses like GPL good enough in a world where AI can take a GPL software codebase and rewrite the code in minutes which can then be released under a different license?

The Current Landscape

Recently, a group of technologists, developers, and legal experts gathered for a “Birds of a Feather” session at the Fifth Elephant conference to discuss these questions in general and to discuss the pressing need for new AI-specific licenses in particular. The discussion started off with Chaitanya drawing parallels to the old open source Linux vs Microsoft wars and how we are in a similar position now with AI. He also discussed about the effort with the Swecha team where they built a Telugu ASR model from scratch by crowdsourcing data and built an ASR model without using GPUs in under $25. If a model can be built in under $25 without GPUs, that shows the power of the data collected. The idea for a new license germinated from the thought to protect this data with the right license. Parallels to the printing press and the creation of the copyright laws were discussed.
The conversation highlighted several key issues:

  1. Strong copyright needed? Historically, open source communities have not pushed for new licenses, but this era of AI is making the open source communities as for new licenses.
  2. Copyleft works because of strong copyright: Copyright is a failure in India and hence copyleft might not be an ideal solution.
  3. Copyleft as deterrent: The role of copyleft as a deterrent for mega corps was discussed. So though enforcing might be a problem, the legal teams of mega corps might consider a strong copyleft license as a deterrent and protection for the future.
  4. Level of openness: A body which can certify the level of openness of an AI model(open data, open weights etc etc) is needed.
  5. Rights and duties framing: A license can be framed as a set of freedoms/rights we get and set of duties the stakeholders have to perform.
  6. Data Collection and Ownership: Unlike traditional software, AI models are trained on vast datasets. Who owns this data? How do we attribute it correctly?
  7. Model Training and Usage: The line between model development and usage is blurring. How do we create licenses that cover the entire lifecycle of an AI model?
  8. Open Washing: There’s a growing concern about companies claiming “openness” without truly adhering to open principles. How can we define and enforce genuine openness in AI?
  9. Enforcement Challenges: Especially in countries with weak copyright laws, enforcing AI licenses could prove difficult. How do we create licenses that are both meaningful and enforceable?
  10. Using existing license or adapting existing license or creating new license from scratch: Everything has pros and cons. The participants were favoring adapting a license like the Open Street Maps license and adapting it for new words like fine tuning etc.
  11. Cloud and API Scenarios: With the rise of AI-as-a-Service, how do we address the unique challenges of cloud-based AI deployments and API usage?

The Path Forward

While we don’t have all the answers yet, the discussion highlighted several potential approaches:

  • Tiered Openness: Creating a system that defines different levels of openness for AI models and datasets, similar to Creative Commons licenses.
  • Transitive Licensing: Developing licenses that account for the complex web of data sources used in model training.
  • Use-Case Differentiation: Crafting licenses that distinguish between individual, institutional, and commercial uses of AI models and datasets.
  • Community Standards: Establishing community-driven standards for what constitutes a truly “open” AI model or dataset.
  • Call for transparency: AI training datasets should be made open to identify and fix biases. Consent should be off by default. If an AI company wants to train some data for its models they have to get explicit consent.
  • New license: A new license for AI is needed and we need to finalize the rights and duties of a such a license which can be handed off to the legal community so that they can come up with a new license.

A Call to Action

As we stand on the brink of an AI-driven future, it’s crucial that we, as a community, come together to address these challenges. We need to create licensing models that foster innovation while protecting the rights of creators and users alike.
The conversation is just beginning, and your voice matters. Whether you’re an AI researcher, a developer, a legal expert, or simply someone interested in the ethical implications of AI, we invite you to join this important discussion.
Together, we can shape the future of AI licensing and ensure that the principles of openness, fairness, and innovation continue to thrive in this new era.

Background

In this rapidly evolving digital era, data acts as the fuel powering the relentless growth of artificial intelligence. As we stand on the brink of technological revolutions, it becomes crucial to understand not just how data drives AI, but also the ethical and legal frameworks that must evolve with it. We should try to look at licensing as a tool to make sure that we can level the playing field where currently because of the access to data and compute the incumbents are reaping most of the benefits of AI.

In this talk we will walk through our journey of trying to find the right license for the crowd sourced data collected for the Telugu ASR system and the model built using the data.

Target Audience

The primary audience for this talk involves Data Scientists and AI Researchers, Legal Professionals with a Focus on Technoloy, Open Source and Community Contributors, Policy Makers and Regulators.
Academics and Students in Technology and Law Fields, Tech Entrepreneurs and Start-up Founders can also benefit from this talk.

Outline

  • Introduction to the Project

    • Overview of the Telugu ASR system
    • Importance of crowdsourced data in ASR technologies
  • Challenges with Licensing Crowdsourced Data

    • Legal complexities of using crowdsourced data
    • Ethical considerations in data collection and usage
  • Requirements for an Effective License

    • Compliance with data protection regulations (e.g., GDPR, CCPA)
    • Flexibility to accommodate contributions from a diverse crowd
    • Clarity on data usage rights and restrictions
  • Journey to Finding the Right License

    • Evaluation of existing licenses (e.g., Creative Commons, MIT, proprietary licenses)
    • Customizing license elements to suit specific needs of datasets for AI which will take care of new terms like fine tuning, model weights etc
    • Engagement with legal experts and the community
  • Next Steps and Future Directions

    • Open house for consultations
    • Finalizing the license and release
  • Q&A

    • Open floor for questions and further discussion

Impact

Introducing a specialized license for crowdsourced data, akin to the impact of the GPL for open-source software, could fundamentally transform how data is utilized in technological innovations. It would promote a collaborative environment where data can be freely shared and enhanced, while ensuring compliance with ethical standards and data protection laws. Such a license would encourage broader participation and innovation, reduce legal barriers, and ensure the sustainability of data resources. It might also help level the playing field by making sure the benefits dont accrue to only the mega corporations in AI. By clarifying usage rights and responsibilities, this new licensing framework could set industry standards for data handling, leading to more responsible and impactful technological advancements across various sectors.

Mindmap
# Introduction to the Project
## Introduction to the Project
### Overview of the Telugu ASR system
#### Development Goals
#### Technological Framework
#### Performance Metrics
### Importance of crowdsourced data in ASR technologies
#### Data Volume and Diversity
#### Quality Improvement
#### Community Engagement
## Challenges with Licensing Crowdsourced Data
### Legal complexities of using crowdsourced data
#### IPR
#### Data Ownership
### Ethical considerations in data collection and usage
#### Informed Consent
#### Bias
## Requirements for an Effective License
### Compliance with data protection regulations (e.g., GDPR, CCPA)
### Flexibility to accommodate contributions from a diverse crowd
### Clarity on data usage rights and restrictions
## Journey to Finding the Right License
### Evaluation of existing licenses
### Customizing license elements to suit specific needs of datasets for AI which will take care of new terms like fine tuning, model weights etc
#### Balancing Flexibility and Control
#### Stakeholder Input
#### New Terms and Conditions
### Engagement with legal experts and the community
## Next Steps and Future Directions
### Open house for consultations
#### Community Involvement
#### Iterative Refinement
### Finalizing the license and release
## Q&A
### Open floor for questions and further discussion

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

Jump starting better data engineering and AI futures

Supported by

Gold Sponsor

Atlassian unleashes the potential of every team. Our agile & DevOps, IT service management and work management software helps teams organize, discuss, and compl

Silver Sponsor

Together, we can build for everyone.

Workshop sponsor

Datastax, the real-time AI Company.

Lanyard Sponsor

We reimagine the way the world moves for the better.

Sponsor

MonsterAPI is an easy and cost-effective GenAI computing platform designed for developers to quickly fine-tune, evaluate and deploy LLMs for businesses.

Community Partner

FOSS United is a non-profit foundation that aims at promoting and strengthening the Free and Open Source Software (FOSS) ecosystem in India. more

Beverage Partner

BONOMI is a ready to drink beverage brand based out of Bangalore. Our first segment into the beverage category is ready to drink cold brew coffee.