Open Source AI Hackathon 2024

GenAI makers and creators contest and showcase

Tickets
  • Select Tickets
  • Payment
  • invoice
  • Attendee details

Membership

The Fifth Elephant annual membership

The Fifth Elephant membership is valid for one year - 12 months. The member get the following benefits:

  • Participation in all online peer review sessions.
  • Access to all recordings from online reviews.
  • Priority access to all offline meet-ups and online workshops hosted by The Fifth Elephant during the one year period.
  • Access to The Fifth Elephant’s Annual Conference on 18 and 19 July 2025 in Bangalore - in-person and virtually (via live stream).

Corporate Members-only benefits (bulk ticket purchase):

  • Transfer of memberships across individuals in the organization.

Memberships can be cancelled within 1 hour of purchase.

₹5100

×

Sale at this price closes on December 31, 2025

Total ₹0

Cancellation and refund policy

Memberships can be cancelled within 1 hour of purchase

Workshop tickets can be cancelled or transferred upto 24 hours prior to the workshop.

For further queries, please write to us at support@hasgeek.com or call us at +91 7676 33 2020.

Akash Kamalesh

@asphytheghoul

Anirudh Lakhotia

Tanistha Hota

@tanisthahota

Multilingual Mixture of Experts for Domain Adaptive Pre-training of Large Language Models

Submitted Jan 23, 2024

baraat

Problem Statement

  • We aim to revolutionize the way classical LMs adapt to different languages and domains.
  • Our goal is to create a group of domain-adaptive pretrained models, each specializing in a unique language, through the application of Mixture of Experts (MoE) in our Domain Adaptive Pre-training.
  • This approach will enable us to leverage the strengths of individual models, enhancing their performance across various domains.
  • We are pushing the boundaries of current techniques, seeking to create a more efficient and versatile modeling strategy for LLMs.

Roadmap

Project Baarat is an open-source initiative to leverage the power of LLMs on Indic-NLP tasks. We aim to build Continually pre-trained, Task Specific Language Models in a Mixture of Experts (MoE) setup. We plan on making a multilingual and cross-lingual LLM that is :

  1. Pre-trained on a large text corpus containing various sources of knowledge including crawled wikipedia articles, textbooks, news, social media sites, magazines etc.

  2. Fine-tuned on different downstream tasks. We first train a 7B LLaMa-2 model on a text corpus in the target language and save it as a base model. We have considered the following tasks as downstream tasks that will be incorporated in the fine-tuning process:

  • Machine Translation
  • Mathematical and Logical Reasoning
  • Question Answering
  • Instruct Fine-Tuning
    image
    image

Key Features ✨

  • Tokenizers for Indian Languages: Robust tokenization tools tailored for the unique structures of regional Indian languages.
  • Fine-tuned Language Models: Leveraging the power of Large Language Models (LLMs) fine-tuned for Indian languages to understand and generate text with high accuracy.
  • Collection of models and data: Completely free and open-source hub of datasets , models, all leveraged into a single python module with documentation for easy usage.
  • High Quality Datasets: Take a look at our suite of cleaned datasets ready for your own downstream training purposes.

Github Repository

https://github.com/asphytheghoul/Baarat

Proposed Solution (subject to changes)

Architecture

Presentation and Demo Video

Here is the link to our presentation : https://docs.google.com/presentation/d/1in4MhQkY6N5SnO-PJ9OVhVIe9K6jOXLRrU45GJbNPF0/edit?usp=sharing

This is the link to our project video demo :
https://drive.google.com/file/d/19YY1dBt0t29NtIZGjQsZivuKwfy9IOkC/view?usp=sharing

Comments

Login to leave a comment

  • AS

    Arvind Saraf

    @arvinds

    Quick question (apologies, I haven't used Switch transformers yet) - MoE usually struggles with context across individual experts. If the text has mix of say Hindi & Kannada, how will the routing be handled - since different parts of the output may get tokens from differetnt LLMs. How are they combined?

    Posted 1 year ago
    • AK

      Akash Kamalesh

      @asphytheghoul Submitter

      Hello Arvind! This is a very interesting case and is quite probable while entering an input. There's two possible cases here that could happen, (we are still exploring about possible alternatives but this is what we have in mind). If a user types a mix in kannada and english (say), the query is converted to an embedding and outputs a probability distribution across the experts. This will involve rigorous training as it will understand the task that the user wants to perform and route it appropriately to the expert. If the input is a case of english , romanized kannada and kannada, the model will be able to handle this because the adapters are trained with such data and initial testing from our end shows plausible results in doing translation between the among 3 languages in any combination. The issue is when you combine two primary languages - Hindi and Kannada in one single prompt. We have developed a language identification model which will identify the language of each sentence and use this information to route the tokens to the appropriate adapter (for that language). So an input with sentences in a mix of languages should be handled appropriately albeit we can only comment on the same after we finish training and have our results! The problem that might arise is what if an input sentence itself has tokens with different languages in it . That is a case we are yet to decide on a method to handle and is also currently under research.

      Posted 1 year ago
  • A

    Akshobhya

    @akshobhya_j Editor & Promoter

    @asphytheghoul, @Anirudh , thank you for your proposal submission to The Fifth Elephant Open Source AI Hackathon. The proposal addresses the need for a more efficient and versatile modeling strategy for Large Language Models (LLMs) to adapt to different languages and domains. The implementation of Mixture of Experts (MoE) to create domain-adaptive pretrained models for specific languages demonstrates an innovative approach to enhancing model performance. This submission needs to be updated based on the following considerations.

    Technical Suggestions

    1. Base Model and Tokenization
    • Utilizing the LLaMa-27b model as the base model for pre-training and customizing BPE sentencepiece tokenizers for Hindi and Kannada languages is a strategic approach.
    • However, an extensive evaluation of the effectiveness of this tokenizer extension and vocabulary modification is necessary.
    1. Pre-training Tasks
    • The selection of machine translation, context learning, question answering, reasoning, and text classification for pre-training tasks offers a diverse set of challenges for the model.
    • Ensuring a balanced distribution of resources and attention to each task will be crucial for comprehensive model learning.
    1. Mixture of Experts Framework
    • The incorporation of the Switch Transformers’s Routing Algorithm for MoE setup is commendable, offering a robust mechanism for assigning tokens to different experts based on language and domain.
    • However, the detailed methodology for aggregation of outputs and ensuring coherence across languages and domains should be thoroughly outlined.
    1. Verifying the efficiency and effectiveness of the MoE architecture specifically for LLMs and multilingual applications is crucial. Robust experimentation and comparative analysis with traditional ensemble techniques could provide valuable insights.
    2. Ethical Considerations and Deployment
    • Integrating measures to prevent hateful speech generation showcases a responsible and ethical stance.
    • Communicating the specifics of these measures and ensuring comprehensive adherence to ethical guidelines will be crucial, especially in multi-lingual and cross-domain scenarios.
    1. Detailed deployment plans onto cloud service providers, considering scalability, accessibility, and security, should be articulated to ensure a seamless transition from research to practical implementation.

    Closing Thoughts

    The proposal presents an ambitious and innovative approach to addressing the adaptability and performance of LLMs across diverse languages and domains. Enhancing the transparency and depth of technical methodologies, thorough validation of extensions and frameworks, and a holistic approach to ethical considerations and deployment will be pivotal in realizing the potential of this groundbreaking initiative. We look forward to witnessing the outcome of this promising endeavor.

    → Utilize the available platforms such as The Fifth Elephant WhatsApp group to engage with mentors and seek guidance on technical and implementation aspects of your project.

    Posted 1 year ago
Hybrid access (members only)

Hosted by

The Fifth Elephant hackathons

Supported by

Host

Jump starting better data engineering and AI futures

Venue host

Welcome to the events page for events hosted at The Terrace @ Hasura. more

Partner

Providing all founders, at any stage, with free resources to build a successful startup.