Dec 2023

18 Mon

19 Tue 05:30 PM – 06:30 PM IST

20 Wed

21 Thu

22 Fri

23 Sat

24 Sun

Jan 2024

1 Mon

2 Tue

3 Wed

4 Thu

5 Fri 05:30 PM – 07:20 PM IST

6 Sat

7 Sun

Jan 2024

8 Mon 06:00 PM – 06:55 PM IST

9 Tue

10 Wed 06:00 PM – 07:00 PM IST

11 Thu

12 Fri 06:00 PM – 07:30 PM IST

13 Sat 03:00 PM – 06:00 PM IST

14 Sun

Jan 2024

22 Mon

23 Tue

24 Wed

25 Thu

26 Fri

27 Sat 05:00 PM – 05:45 PM IST

28 Sun

Feb 2024

29 Mon

30 Tue

31 Wed

1 Thu

2 Fri

3 Sat 10:00 AM – 06:25 PM IST

4 Sun

Feb 2024

5 Mon

6 Tue

7 Wed 08:15 PM – 09:00 PM IST

8 Thu

9 Fri

10 Sat

11 Sun

Feb 2024

12 Mon 08:15 PM – 09:00 PM IST

13 Tue 08:15 PM – 09:00 PM IST

14 Wed 08:15 PM – 09:00 PM IST

15 Thu 08:15 PM – 09:00 PM IST

16 Fri 07:30 PM – 08:30 PM IST

17 Sat 08:15 PM – 09:00 PM IST

18 Sun

Feb 2024

19 Mon

20 Tue

21 Wed 08:30 PM – 09:15 PM IST

22 Thu

23 Fri

24 Sat

25 Sun

Mar 2024

4 Mon

5 Tue

6 Wed

7 Thu

8 Fri

9 Sat 07:00 PM – 09:00 PM IST

10 Sun 04:00 PM – 06:00 PM IST

Apr 2024

8 Mon

9 Tue

10 Wed

11 Thu

12 Fri 12:00 PM – 06:25 PM IST

13 Sat

14 Sun

Hasura, Bangalore

Multilingual Mixture of Experts for Domain Adaptive Pre-training of Large Language Models

Submitted Jan 23, 2024

Category: AI for multilingual

baraat

Problem Statement

We aim to revolutionize the way classical LMs adapt to different languages and domains.
Our goal is to create a group of domain-adaptive pretrained models, each specializing in a unique language, through the application of Mixture of Experts (MoE) in our Domain Adaptive Pre-training.
This approach will enable us to leverage the strengths of individual models, enhancing their performance across various domains.
We are pushing the boundaries of current techniques, seeking to create a more efficient and versatile modeling strategy for LLMs.

Roadmap

Project Baarat is an open-source initiative to leverage the power of LLMs on Indic-NLP tasks. We aim to build Continually pre-trained, Task Specific Language Models in a Mixture of Experts (MoE) setup. We plan on making a multilingual and cross-lingual LLM that is :

Pre-trained on a large text corpus containing various sources of knowledge including crawled wikipedia articles, textbooks, news, social media sites, magazines etc.
Fine-tuned on different downstream tasks. We first train a 7B LLaMa-2 model on a text corpus in the target language and save it as a base model. We have considered the following tasks as downstream tasks that will be incorporated in the fine-tuning process:

Machine Translation
Mathematical and Logical Reasoning
Question Answering
Instruct Fine-Tuning

Key Features ✨

Tokenizers for Indian Languages: Robust tokenization tools tailored for the unique structures of regional Indian languages.
Fine-tuned Language Models: Leveraging the power of Large Language Models (LLMs) fine-tuned for Indian languages to understand and generate text with high accuracy.
Collection of models and data: Completely free and open-source hub of datasets , models, all leveraged into a single python module with documentation for easy usage.
High Quality Datasets: Take a look at our suite of cleaned datasets ready for your own downstream training purposes.