Building Reinforcement Learning (RL) for LLMs from scratch

From policy gradients to training reasoning agents live - hands-on workshop

Tickets

Loading…

📘 Overview

This workshop demystifies how modern reasoning models like DeepSeek-R1 and Qwen are trained.

Participants will:

  • Build a minimal RL training pipeline from scratch (~500 lines of code)
  • Understand the core GRPO algorithm through code (not math)
  • Train a small model live to solve multi-turn tasks

Key takeaway

RL for LLMs is simpler than it looks.
At its core, it’s just weighted log probabilities.

By building everything from primitives, attendees leave with both conceptual clarity and working code they can extend.


Target audience

Developers who have:

  • Built LLM agents (prompting, tool calling, chains)
  • Never touched the training/RL side

No ML research background required.


✅ Prerequisites

  • Familiarity with Python and basic PyTorch (tensors, forward pass)
  • Have built at least one LLM agent or chain
  • Laptop with Google Colab access (GPU provided)

What you’ll learn

  • How modern LLMs like DeepSeek-R1 and Qwen are trained to reason
  • The core RL algorithm (GRPO) behind most post-training today
  • How to design environments with verifiable rewards (RLVR)
  • Train a small model to solve multi-turn tasks - live

Workshop outline


Part 1: the landscape (30 minutes)

  • Why RL for LLMs now? (The DeepSeek-R1 moment)
  • RL as “efficient in-context learning baked into weights”
  • Offline vs online: DPO vs PPO/GRPO
  • What is RLVR and why verifiable rewards matter

No code. Diagrams, intuition, the ‘why’ before the ‘how’.


Part 2: the Core Algorithm (45 minutes)

From REINFORCE to GRPO in three steps:

  1. Reward × log probability = gradient signal
  2. Baselines to reduce variance (group-relative trick)
  3. Clipping for stability

Live coding:

  • get_logprobs()
  • compute_advantages()
  • policy_gradient_loss()

GRPO implemented in ~50 lines.


Part 3: building blocks (45 minutes)

Core abstractions

  • Agent
  • Environment
  • Trajectory

Topics:

  • Sampling with logprobs
  • Designing verifiable rewards
  • The interaction loop (collect_trajectory())

Hands-on environments

  • Math reasoning (GSM8K-style, regex-verifiable)
  • Wordle RLVR demo (multi-turn, partial credit)

Part 4: putting it together (40 minutes)

  • Wire up agent + environment + GRPO
  • Live training demo: Qwen-0.5B learns the task
  • Interpreting loss curves & reward trends
  • Common failure modes
  • Where production RL systems go next

Attendees train in provided Colab notebook.


Part 5: Q&A + what’s next (10 minutes)

  • Multi-agent RL
  • Long-horizon training
  • Process supervision
  • Resources to go deeper

Key takeaways

Participants will leave with:

  1. Clear mental model of RL for LLM training
  2. Working training code they can extend
  3. Ability to read real RL codebases (TRL, OpenRLHF, etc.)
  4. Confidence to build their own RL pipelines

Materials Provided

  • GitHub repo with minimal RL framework (~500 lines)
  • Google Colab notebooks
  • Companion blog post
  • Pre-trained checkpoints
  • WandB logs

About the instructor

Siddharth Balyan works on RL environments, LLM agents, and training infrastructure. He has contributed multiple environments to Prime Intellect’s Environment Hub and worked on integrating RL frameworks with training systems. Previously MTS at Composio, where he built AI integration platforms for LLM agents. Co-authored a paper on fine-tuning LLMs for observability

X/Twitter: @sidbing
Blog: https://sidb.in


🚀 Why this workshop? Why now?

DeepSeek-R1 proved that pure RL can produce reasoning capabilities rivaling closed models. GRPO is now the backbone of most LLM post-training — yet few developers have ever looked inside the training loop. This workshop bridges the gap: from agent builder → agent trainer.

How to attend this workshop

This workshop is open for The Fifth Elephant annual members.

This workshop is open to 30 participants. Seats will be available on first-come-first-serve basis. 🎟️

Contact information ☎️

For inquiries about the workshop, contact +91-7676332020 or write to info@hasgeek.com

Hosted by

Jumpstart better data engineering and AI futures