This workshop demystifies how modern reasoning models like DeepSeek-R1 and Qwen are trained.
Participants will:
- Build a minimal RL training pipeline from scratch (~500 lines of code)
- Understand the core GRPO algorithm through code (not math)
- Train a small model live to solve multi-turn tasks
RL for LLMs is simpler than it looks.
At its core, it’s just weighted log probabilities.
By building everything from primitives, attendees leave with both conceptual clarity and working code they can extend.
Developers who have:
- Built LLM agents (prompting, tool calling, chains)
- Never touched the training/RL side
No ML research background required.
- Familiarity with Python and basic PyTorch (tensors, forward pass)
- Have built at least one LLM agent or chain
- Laptop with Google Colab access (GPU provided)
- How modern LLMs like DeepSeek-R1 and Qwen are trained to reason
- The core RL algorithm (GRPO) behind most post-training today
- How to design environments with verifiable rewards (RLVR)
- Train a small model to solve multi-turn tasks - live
- Why RL for LLMs now? (The DeepSeek-R1 moment)
- RL as “efficient in-context learning baked into weights”
- Offline vs online: DPO vs PPO/GRPO
- What is RLVR and why verifiable rewards matter
No code. Diagrams, intuition, the ‘why’ before the ‘how’.
From REINFORCE to GRPO in three steps:
- Reward × log probability = gradient signal
- Baselines to reduce variance (group-relative trick)
- Clipping for stability
get_logprobs()
compute_advantages()
policy_gradient_loss()
GRPO implemented in ~50 lines.
- Agent
- Environment
- Trajectory
Topics:
- Sampling with logprobs
- Designing verifiable rewards
- The interaction loop (
collect_trajectory())
- Math reasoning (GSM8K-style, regex-verifiable)
- Wordle RLVR demo (multi-turn, partial credit)
- Wire up agent + environment + GRPO
- Live training demo: Qwen-0.5B learns the task
- Interpreting loss curves & reward trends
- Common failure modes
- Where production RL systems go next
Attendees train in provided Colab notebook.
- Multi-agent RL
- Long-horizon training
- Process supervision
- Resources to go deeper
Participants will leave with:
- Clear mental model of RL for LLM training
- Working training code they can extend
- Ability to read real RL codebases (TRL, OpenRLHF, etc.)
- Confidence to build their own RL pipelines
- GitHub repo with minimal RL framework (~500 lines)
- Google Colab notebooks
- Companion blog post
- Pre-trained checkpoints
- WandB logs
Siddharth Balyan works on RL environments, LLM agents, and training infrastructure. He has contributed multiple environments to Prime Intellect’s Environment Hub and worked on integrating RL frameworks with training systems. Previously MTS at Composio, where he built AI integration platforms for LLM agents. Co-authored a paper on fine-tuning LLMs for observability
X/Twitter: @sidbing
Blog: https://sidb.in
DeepSeek-R1 proved that pure RL can produce reasoning capabilities rivaling closed models. GRPO is now the backbone of most LLM post-training — yet few developers have ever looked inside the training loop. This workshop bridges the gap: from agent builder → agent trainer.
This workshop is open for The Fifth Elephant annual members.
This workshop is open to 30 participants. Seats will be available on first-come-first-serve basis. 🎟️
For inquiries about the workshop, contact +91-7676332020 or write to info@hasgeek.com