chatGPT caught the public imagination and for the first time non-tech people could experience Generative AI. This led to a surge in interest to develop safe applications of LLMs as well as developing domain specific or open source alternatives to chatGPT. Notable among them is LLaMA 2 - an LLM open sourced by Meta. This release catalyzed the development of tasks, tools, and assets ranging from datasets to new models to new applictions. An SLM called Phi 2 released by Microsoft also showed that small models (reletively that is) can also compete with large models, which can be trained and served at substantially lower costs. However, there are some challenges.
- Majority, if not all the LLMs, we see today are based on proven Transformer based architectures. And Transfomres have quadratic (in inputs tokens) complexity - therefore slow to train and infer. As a result, new memory and compute efficient attention mechanisms have sprungup, along with Engineering hacks. But, at the end of the day, they are still based on Transformer-based architectures.
- Majority, with the exception of some Chinese LLMs, are English-centric and other languages have a token representation (no pun intended).
- Often, LLMs have a particulalr tokenizer -- which makes extension to other languages/ domains hard.
- Developing SLMs or LLMs is still a compute heavy problem. Therefore only big corporations with deep pockets, massive talent concentration and GPU farms can afford to build such models.
In this hackathon, we like to address the above challenges.
- Develop a multilingual S4-based SLM on samantar dataset
- Decentralise the training and development of SLMs/LLMs via simple federated learning framework
- Tokenizer-free: No subword tokenizers. Use ByT5 - so that models can be trained end-to-end. However, byte level tokens increase context length, which puts Transformer based architectures at disadvantage
- RoPE embeddings: Byte level tokens (unicode characters) increase the context length -- so have to deal with extended context lengths via RoPE embeddings (or other)
- Mamba: S4 (sparse structued state space models) are now competing with Transformer based models in sequence representation/classification problems. Replace Tansformer architecture with Mamba for efficient training and inference (which address multiple problems due the entanglement of Transformer architecture with subword tokenizers)
- Train the above model on small multilingual indic dataset
- Add LoRA adapters to Mamba (to finetune LLMs on modest resoruces both data and compute) - few research questions arise here!
- Implement a Client-Serve Architecture for Federated Learning (no emphasis on privacy at this time - as datasets used will be public)
- Client side
3.1 Client downloads a latest pre-trained model, a small dataset, and initialises the adapter
3.2 fine-tunes the adapter on a small subset of the data
3.3 Pushes the adapter back to the hub (which server can access)
- Server side
4.1 Server issues a client - a pretrained model, an initialised adapter, and a small dataset - depending on the compute budget (flops and time)
4.2 Server merges the adapters with the base model
4.3 Does a continual pre-training of the base
4.4 Checkpoints the pre-trained model
This simple transactional, client side memory less, federated learning framework - democriatises the training (and development) of LLMs/ SLMs
- Mamba supported in HF (say in Transformers library)
- SFT and Model merging with PEFT library
- Submitting adapters and sharing pre-training checkpoints via HF Hub
With community participation (particulalry, the student commuity), we want to understand
- How do S4 models compare against Transfomer based model on many dimensions like scaling laws, compute efficiency, etc?
- As a function of model size - how does the performance change
- As a function of vocabulary size - how does the training and inference time change
- How does the cross-lingual transfer capability gets affected as a funciton of the representation in the data?
- What federated learning policy is suitable for distributed training LLMs w.r.t data partitions, adapters, model merging?
- Many more questions will come up in the due course as we encounter many problems and challenges!