Building Reinforcement Learning (RL) for LLMs from scratch

Name: Building Reinforcement Learning (RL) for LLMs from scratch
Start: 2026-02-28T13:45:00+05:30
End: 2026-02-28T17:00:00+05:30
Location: E2E Networks Ltd

From policy gradients to training reasoning agents live - hands-on workshop

Feb 2026

23 Mon

24 Tue

25 Wed

26 Thu

27 Fri

28 Sat 01:45 PM – 05:00 PM IST

1 Sun

Hybrid access (members only)

E2E Networks Ltd, Bangalore

Feb 2026

23 Mon

24 Tue

25 Wed

26 Thu

27 Fri

28 Sat 01:45 PM – 05:00 PM IST

1 Sun

E2E Networks Ltd, Bangalore

Tickets

Pinned update

Join the workshop WhatsApp group This update is for participants only

📘 Overview

This workshop demystifies how modern reasoning models like DeepSeek-R1 and Qwen are trained.

Participants will:

Build a minimal RL training pipeline from scratch (~500 lines of code)
Understand the core GRPO algorithm through code (not math)
Train a small model live to solve multi-turn tasks

Key takeaway

RL for LLMs is simpler than it looks.
At its core, it’s just weighted log probabilities.

By building everything from primitives, attendees leave with both conceptual clarity and working code they can extend.

Target audience

Developers who have:

Built LLM agents (prompting, tool calling, chains)
Never touched the training/RL side

No ML research background required.

✅ Prerequisites

Familiarity with Python and basic PyTorch (tensors, forward pass)
Have built at least one LLM agent or chain
Laptop with Google Colab access (GPU provided)

What you’ll learn

How modern LLMs like DeepSeek-R1 and Qwen are trained to reason
The core RL algorithm (GRPO) behind most post-training today
How to design environments with verifiable rewards (RLVR)
Train a small model to solve multi-turn tasks - live

Workshop outline

Part 1: the landscape (30 minutes)

Why RL for LLMs now? (The DeepSeek-R1 moment)
RL as “efficient in-context learning baked into weights”
Offline vs online: DPO vs PPO/GRPO
What is RLVR and why verifiable rewards matter

No code. Diagrams, intuition, the ‘why’ before the ‘how’.

Part 2: the Core Algorithm (45 minutes)

From REINFORCE to GRPO in three steps:

Reward × log probability = gradient signal
Baselines to reduce variance (group-relative trick)
Clipping for stability

Live coding:

get_logprobs()
compute_advantages()
policy_gradient_loss()

GRPO implemented in ~50 lines.

Part 3: building blocks (45 minutes)

Core abstractions

Agent
Environment
Trajectory

Topics:

Sampling with logprobs
Designing verifiable rewards
The interaction loop (collect_trajectory())

Hands-on environments

Math reasoning (GSM8K-style, regex-verifiable)
Wordle RLVR demo (multi-turn, partial credit)

Part 4: putting it together (40 minutes)

Wire up agent + environment + GRPO
Live training demo: Qwen-0.5B learns the task
Interpreting loss curves & reward trends
Common failure modes
Where production RL systems go next

Attendees train in provided Colab notebook.

Part 5: Q&A + what’s next (10 minutes)

Multi-agent RL
Long-horizon training
Process supervision
Resources to go deeper

Key takeaways

Participants will leave with:

Clear mental model of RL for LLM training
Working training code they can extend
Ability to read real RL codebases (TRL, OpenRLHF, etc.)
Confidence to build their own RL pipelines

Materials Provided

GitHub repo with minimal RL framework (~500 lines)
Google Colab notebooks
Companion blog post
Pre-trained checkpoints
WandB logs

Siddharth Balyan works on RL environments, LLM agents, and training infrastructure. He has contributed multiple environments to Prime Intellect’s Environment Hub and worked on integrating RL frameworks with training systems. Previously MTS at Composio, where he built AI integration platforms for LLM agents. Co-authored a paper on fine-tuning LLMs for observability

X/Twitter: @sidbing
Blog: https://sidb.in

🚀 Why this workshop? Why now?

DeepSeek-R1 proved that pure RL can produce reasoning capabilities rivaling closed models. GRPO is now the backbone of most LLM post-training — yet few developers have ever looked inside the training loop. This workshop bridges the gap: from agent builder → agent trainer.

How to attend this workshop

This workshop is open for The Fifth Elephant annual members.

This workshop is open to 30 participants. Seats will be available on first-come-first-serve basis. 🎟️

Contact information ☎️

For inquiries about the workshop, contact +91-7676332020 or write to info@hasgeek.com

Venue

E2E Networks Ltd

2nd floor, 777c, 100 Feet Rd, HAL 2nd Stage,

Indiranagar

Bangalore - 560008

Karnataka, IN

Loading…

Hosted by

The Fifth Elephant

Jumpstart better data engineering and AI futures

Supported by

Venue host

E2E Networks Limited

E2E Cloud is India's first AI hyper scaler, a cloud computing platform providing accelerated cloud-based solutions at maximum optimization and lowest pricing