The Fifth Elephant Open Source AI Hackathon 2024

GenAI makers and creators contest and showcase

Tickets

Loading…

Akash Kamalesh

@asphytheghoul

Anirudh Lakhotia

Tanistha Hota

@tanisthahota

Multilingual Mixture of Experts for Domain Adaptive Pre-training of Large Language Models

Submitted Jan 23, 2024

baraat

Problem Statement

  • We aim to revolutionize the way classical LMs adapt to different languages and domains.
  • Our goal is to create a group of domain-adaptive pretrained models, each specializing in a unique language, through the application of Mixture of Experts (MoE) in our Domain Adaptive Pre-training.
  • This approach will enable us to leverage the strengths of individual models, enhancing their performance across various domains.
  • We are pushing the boundaries of current techniques, seeking to create a more efficient and versatile modeling strategy for LLMs.

Roadmap

Project Baarat is an open-source initiative to leverage the power of LLMs on Indic-NLP tasks. We aim to build Continually pre-trained, Task Specific Language Models in a Mixture of Experts (MoE) setup. We plan on making a multilingual and cross-lingual LLM that is :

  1. Pre-trained on a large text corpus containing various sources of knowledge including crawled wikipedia articles, textbooks, news, social media sites, magazines etc.

  2. Fine-tuned on different downstream tasks. We first train a 7B LLaMa-2 model on a text corpus in the target language and save it as a base model. We have considered the following tasks as downstream tasks that will be incorporated in the fine-tuning process:

  • Machine Translation
  • Mathematical and Logical Reasoning
  • Question Answering
  • Instruct Fine-Tuning
    image
    image

Key Features ✨

  • Tokenizers for Indian Languages: Robust tokenization tools tailored for the unique structures of regional Indian languages.
  • Fine-tuned Language Models: Leveraging the power of Large Language Models (LLMs) fine-tuned for Indian languages to understand and generate text with high accuracy.
  • Collection of models and data: Completely free and open-source hub of datasets , models, all leveraged into a single python module with documentation for easy usage.
  • High Quality Datasets: Take a look at our suite of cleaned datasets ready for your own downstream training purposes.

Github Repository

https://github.com/asphytheghoul/Baarat

Proposed Solution (subject to changes)

Architecture

Presentation and Demo Video

Here is the link to our presentation : https://docs.google.com/presentation/d/1in4MhQkY6N5SnO-PJ9OVhVIe9K6jOXLRrU45GJbNPF0/edit?usp=sharing

This is the link to our project video demo :
https://drive.google.com/file/d/19YY1dBt0t29NtIZGjQsZivuKwfy9IOkC/view?usp=sharing

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hybrid access (members only)

Hosted by

The Fifth Elephant hackathons

Supported by

Host

Jump starting better data engineering and AI futures

Venue host

Welcome to the events page for events hosted at The Terrace @ Hasura. more

Partner

Providing all founders, at any stage, with free resources to build a successful startup.