Distilling a 1.5B Code Reviewer That Rivals LLMs An Iterative SFT+DPO Pipeline on OpenShift AI

Apr 2026

6 Mon

7 Tue

8 Wed

9 Thu

10 Fri 04:00 PM – 08:30 PM IST

11 Sat 09:00 AM – 06:00 PM IST

12 Sun 09:00 AM – 06:00 PM IST

National Institute of Technology, Calicut, Calicut

Distilling a 1.5B Code Reviewer That Rivals LLMs An Iterative SFT+DPO Pipeline on OpenShift AI

Submitted Apr 7, 2026

Abstract

Large Language Models can review code, but deploying a 70B model behind every pull request is neither practical nor cost-effective. What if a 1.5B-parameter Small Language Model could deliver reviewer-quality comments on Go, Python, and Kubernetes diffs — running on a single GPU at inference time?

In this talk, we present a production-grade iterative distillation pipeline that trains a compact code review SLM to match — and in targeted domains, outperform — models 5–50x its size. Starting from Qwen2.5-Coder-1.5B-Instruct, we apply multi-stage knowledge distillation: a 7B teacher model generates structured review training data, Supervised Fine-Tuning (SFT) teaches the student to speak like a reviewer, and Direct Preference Optimization (DPO) teaches it what good reviews look like versus lazy “no issues found” responses. Crucially, each pipeline run retrains from the previous run’s checkpoint — the model gets smarter with every iteration.

The entire workflow runs end-to-end on Red Hat OpenShift AI: Kubeflow Pipelines orchestrates a 7-step DAG, PyTorchJobs distribute QLoRA training across 20 GPUs on 5 nodes, KServe deploys the model with zero-downtime upgrades, MLflow tracks metrics across runs, and MinIO provides S3-compatible artifact storage. No notebook-driven one-offs — this is a repeatable, versioned, self-improving training loop.

We’ll walk through the architecture, show real before-and-after examples of the model catching goroutine leaks and password logging in Kubernetes operator code, share the hard-won lessons from scaling distributed training on ephemeral cloud GPUs, and demonstrate how DPO preference learning eliminates the “model collapse” failure mode that plagues naive fine-tuning.

Key Takeaways

SLMs can be domain-specialized to rival LLMs — a 1.5B model fine-tuned on 8K+ curated code reviews produces structured, actionable feedback that generic 7B+ models miss.
Iterative distillation is a force multiplier — each pipeline run retrains from the previous checkpoint (N-1 model), compounding improvements without human intervention.
DPO fixes what SFT breaks — without preference optimization, fine-tuned models collapse to safe, empty responses. DPO teaches the model to prefer detailed analysis over “LGTM.”
OpenShift AI provides a production MLOps backbone — Kubeflow Pipelines + PyTorchJob + KServe + MLflow is a complete, Kubernetes-native stack for training, deploying, and monitoring SLMs at scale.
Multi-node distributed training is table stakes — we’ll show how to go from a single-GPU 2-hour training run to a 20-GPU 18-minute run using PyTorchJob with DDP, and the pitfalls (NCCL, /dev/shm, node scheduling) you’ll hit along the way.

Outline (30–40 min)

Time	Section	Content
5 min	The Problem	LLMs are too expensive for per-PR review. SLMs are too dumb out of the box. Can we close the gap?
5 min	Data Pipeline	Mining 200 real reviews from `kubeflow/trainer`, supplementing with 8K HuggingFace examples, teacher enrichment via Ollama, data quality traps (poison templates, lazy negatives)
8 min	The 7-Step Pipeline	Resolve Version → Extract Gold → SFT (QLoRA) → Deploy → Extract Preferences → DPO → Evaluate. Live demo of the Kubeflow DAG.
5 min	Iterative Training	N-1 model as base, how compounding SFT+DPO cycles improve scores across runs, MLflow metric comparisons
5 min	Scaling to 20 GPUs	PyTorchJob multi-node setup, node selectors, GPU scheduling wars, NCCL debugging, 6x speedup results
5 min	DPO & Model Collapse	Why the model learned to say “no issues found” for everything, how we diagnosed it (data poisoning), how DPO preference pairs fixed it
5 min	Live Demo	Submit a buggy Kubernetes operator diff → watch the SLM catch the goroutine leak, compare with teacher model output
2 min	What’s Next	GRPO with reward functions, GitHub Action integration, expanding to Rust/Java

Speaker Bio

Sridhar Pillai — Software Engineer at Red Hat, working on AI/ML platform tooling for OpenShift AI. Building production SLM training pipelines and MLOps infrastructure on Kubernetes. Contributor to Kubeflow Training Operator ecosystem.

Technical Details

Component	Technology
Base Model	Qwen2.5-Coder-1.5B-Instruct
Teacher Model	qwen2.5-coder:7b-instruct (Ollama)
Training Method	QLoRA (4-bit) SFT + DPO
Distributed Training	PyTorchJob, DDP, 5 nodes × 4 T4 GPUs
Orchestration	Kubeflow Pipelines v2 (Argo Workflows)
Serving	KServe + vLLM runtime
Experiment Tracking	MLflow
Artifact Storage	MinIO (S3-compatible)
Platform	Red Hat OpenShift AI on AWS (g4dn.12xlarge)
Training Data	8,296 curated code review examples (Go, Python, YAML)
Languages Reviewed	Go, Python, Kubernetes YAML

Submission Tags

MLOps · Small Language Models · Knowledge Distillation · Code Review · Kubernetes · OpenShift AI · DPO · Distributed Training · Kubeflow

FOSSMeet'26