The Announcement

On 18th April 2024, Meta introduced pre-trained and instruction-fine-tuned language models with 8B and 70B parameters for a broad range of use cases, including improved reasoning.

For more details, refer to the official launch blog post.

Benchmarks got crushed

What sets these models apart is their performance for the given sizes. The Llama3-8B model crushes existing 7B models like Gemma-7B and Mistral-7B by a good margin on benchmarks and even outperforms bigger models on generic HumanEvals like GPT-3.56 Turbo, Gemini Pro, Mistral Next, etc. In English-only HumanEval, it went on to beat models approximately 200x its size like GPT-4 (first version, rumored to be 8x220B), Bard, and Claude3 Opus.

This is insane performance for models of much smaller size, basically going past Chinchilla’s optimal point by 75X (projected training). This teaches us that the models can learn much more by training longer, and possibly all of the current models are undertrained.

Model architecture remains more or less the same

The Llama3 architecture is largely the same as Llama2 but significantly different from the transformer model (Llama architecture drawn by Umar Jamil).

Llama architecture diagram

Llama is a decoder only architecture for text generation.
The normalization blocks are placed before attention and FeedForward.
Positional embeddings are different, Llama uses Rotary Positional Embeddings.
Attention mechanism is different, Llama uses Grouped Multi-Query Attention.
Activation function in Feed-Forward Network (FFN) is SwiGLU.
FFN’s hidden dimension is larger for adjusting parameters saved with Grouped Multi-Query (GMQ) Attention.

Key differences

Llama3’s great performance is attributed to longer training with extremely high-quality data. Although the Llama3 architecture remains the same, there are a few key changes.

Vocabulary: Llama3 uses a tokenizer with vocabulary of 128k tokens as compared to 32k used by Llama2. The 4X larger vocabulary helps to encode language much more efficiently.
Tokenizer: Llama3 uses Tiktoken as compared to sentencepice tokenizer used in LLama2
Grouped Query Attention: GQA is used in both the model variants as compared to Llama2 which used it for only 70B model.

Pre-training was done on completely publicly available data, high quality 15 Trillion token dataset.

Post-training applied combinations of Supervised Fine Tuning (SFT), Reject Sampling, Proximal Policy Optimisation(PPO) and Direct Policy Optimisation (DPO).

Artifacts released

Meta has released the model with Github repo.

We will get to learn more about Meta’s training and alignment process when the paper is out but until then the best way to learn about the Llama3 is to understand the architecture by digging into the code itself.

Decoding Llama3 microblogs series - What to expect

Decoding Llama3 is a 7 part microblog series that aims to break down the Llama3 architecture released by Meta. We are going to dive deep into the internals of architecture through the code. The details are split concept wise into below microblogs -

Part 1 - Intro to Llama3
Part 2 - Understanding the configuration
Part 3 - Normalisation
Part 4 - Rotary Positional Embeddings
Part 5 - Grouped Query Attention
Part 6 - Feed Forward Network
Part 7 - Transformer Block

Up next>> Part 2 - Understanding the configuration

The next microblog will cover configuration details of the Llama3 model - Decoding Llama3: Part 2 - Understanding the configuration.

Decoding Llama3: An explainer for tinkerers

Decoding Llama3: Part 1 - Intro to Llama3