Decoding Llama3: An explainer for tinkerers
A not-so-quick 7-part guide to using the Llama3 open source AI model
On 18th April 2024, Meta introduced pre-trained and instruction-fine-tuned language models with 8B and 70B parameters for a broad range of use cases, including improved reasoning.
For more details, refer to the official launch blog post.
What sets these models apart is their performance for the given sizes. The Llama3-8B model crushes existing 7B models like Gemma-7B and Mistral-7B by a good margin on benchmarks and even outperforms bigger models on generic HumanEvals like GPT-3.56 Turbo, Gemini Pro, Mistral Next, etc. In English-only HumanEval, it went on to beat models approximately 200x its size like GPT-4 (first version, rumored to be 8x220B), Bard, and Claude3 Opus.
This is insane performance for models of much smaller size, basically going past Chinchilla’s optimal point by 75X (projected training). This teaches us that the models can learn much more by training longer, and possibly all of the current models are undertrained.
The Llama3 architecture is largely the same as Llama2 but significantly different from the transformer model (Llama architecture drawn by Umar Jamil).
Llama3’s great performance is attributed to longer training with extremely high-quality data. Although the Llama3 architecture remains the same, there are a few key changes.
Pre-training was done on completely publicly available data, high quality 15 Trillion token dataset.
Post-training applied combinations of Supervised Fine Tuning (SFT), Reject Sampling, Proximal Policy Optimisation(PPO) and Direct Policy Optimisation (DPO).
Meta has released the model with Github repo.
We will get to learn more about Meta’s training and alignment process when the paper is out but until then the best way to learn about the Llama3 is to understand the architecture by digging into the code itself.
Decoding Llama3 is a 7 part microblog series that aims to break down the Llama3 architecture released by Meta. We are going to dive deep into the internals of architecture through the code. The details are split concept wise into below microblogs -
Part 1 - Intro to Llama3
Part 2 - Understanding the configuration
Part 3 - Normalisation
Part 4 - Rotary Positional Embeddings
Part 5 - Grouped Query Attention
Part 6 - Feed Forward Network
Part 7 - Transformer Block
The next microblog will cover configuration details of the Llama3 model - Decoding Llama3: Part 2 - Understanding the configuration.
Hosted by
Supported by
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}