FOSSMeet'26

FOSSMeet'26

Open Mind. Open Source.

Sridhar Pillai

@sri_1030

Distilling a 1.5B Code Reviewer That Rivals LLMs An Iterative SFT+DPO Pipeline on OpenShift AI

Submitted Apr 7, 2026

Abstract

Large Language Models can review code, but deploying a 70B model behind every pull request is neither practical nor cost-effective. What if a 1.5B-parameter Small Language Model could deliver reviewer-quality comments on Go, Python, and Kubernetes diffs — running on a single GPU at inference time?

In this talk, we present a production-grade iterative distillation pipeline that trains a compact code review SLM to match — and in targeted domains, outperform — models 5–50x its size. Starting from Qwen2.5-Coder-1.5B-Instruct, we apply multi-stage knowledge distillation: a 7B teacher model generates structured review training data, Supervised Fine-Tuning (SFT) teaches the student to speak like a reviewer, and Direct Preference Optimization (DPO) teaches it what good reviews look like versus lazy “no issues found” responses. Crucially, each pipeline run retrains from the previous run’s checkpoint — the model gets smarter with every iteration.

The entire workflow runs end-to-end on Red Hat OpenShift AI: Kubeflow Pipelines orchestrates a 7-step DAG, PyTorchJobs distribute QLoRA training across 20 GPUs on 5 nodes, KServe deploys the model with zero-downtime upgrades, MLflow tracks metrics across runs, and MinIO provides S3-compatible artifact storage. No notebook-driven one-offs — this is a repeatable, versioned, self-improving training loop.

We’ll walk through the architecture, show real before-and-after examples of the model catching goroutine leaks and password logging in Kubernetes operator code, share the hard-won lessons from scaling distributed training on ephemeral cloud GPUs, and demonstrate how DPO preference learning eliminates the “model collapse” failure mode that plagues naive fine-tuning.


Key Takeaways

  1. SLMs can be domain-specialized to rival LLMs — a 1.5B model fine-tuned on 8K+ curated code reviews produces structured, actionable feedback that generic 7B+ models miss.

  2. Iterative distillation is a force multiplier — each pipeline run retrains from the previous checkpoint (N-1 model), compounding improvements without human intervention.

  3. DPO fixes what SFT breaks — without preference optimization, fine-tuned models collapse to safe, empty responses. DPO teaches the model to prefer detailed analysis over “LGTM.”

  4. OpenShift AI provides a production MLOps backbone — Kubeflow Pipelines + PyTorchJob + KServe + MLflow is a complete, Kubernetes-native stack for training, deploying, and monitoring SLMs at scale.

  5. Multi-node distributed training is table stakes — we’ll show how to go from a single-GPU 2-hour training run to a 20-GPU 18-minute run using PyTorchJob with DDP, and the pitfalls (NCCL, /dev/shm, node scheduling) you’ll hit along the way.


Outline (30–40 min)

Time Section Content
5 min The Problem LLMs are too expensive for per-PR review. SLMs are too dumb out of the box. Can we close the gap?
5 min Data Pipeline Mining 200 real reviews from kubeflow/trainer, supplementing with 8K HuggingFace examples, teacher enrichment via Ollama, data quality traps (poison templates, lazy negatives)
8 min The 7-Step Pipeline Resolve Version → Extract Gold → SFT (QLoRA) → Deploy → Extract Preferences → DPO → Evaluate. Live demo of the Kubeflow DAG.
5 min Iterative Training N-1 model as base, how compounding SFT+DPO cycles improve scores across runs, MLflow metric comparisons
5 min Scaling to 20 GPUs PyTorchJob multi-node setup, node selectors, GPU scheduling wars, NCCL debugging, 6x speedup results
5 min DPO & Model Collapse Why the model learned to say “no issues found” for everything, how we diagnosed it (data poisoning), how DPO preference pairs fixed it
5 min Live Demo Submit a buggy Kubernetes operator diff → watch the SLM catch the goroutine leak, compare with teacher model output
2 min What’s Next GRPO with reward functions, GitHub Action integration, expanding to Rust/Java

Speaker Bio

Sridhar Pillai — Software Engineer at Red Hat, working on AI/ML platform tooling for OpenShift AI. Building production SLM training pipelines and MLOps infrastructure on Kubernetes. Contributor to Kubeflow Training Operator ecosystem.


Technical Details

Component Technology
Base Model Qwen2.5-Coder-1.5B-Instruct
Teacher Model qwen2.5-coder:7b-instruct (Ollama)
Training Method QLoRA (4-bit) SFT + DPO
Distributed Training PyTorchJob, DDP, 5 nodes × 4 T4 GPUs
Orchestration Kubeflow Pipelines v2 (Argo Workflows)
Serving KServe + vLLM runtime
Experiment Tracking MLflow
Artifact Storage MinIO (S3-compatible)
Platform Red Hat OpenShift AI on AWS (g4dn.12xlarge)
Training Data 8,296 curated code review examples (Go, Python, YAML)
Languages Reviewed Go, Python, Kubernetes YAML

Submission Tags

MLOps · Small Language Models · Knowledge Distillation · Code Review · Kubernetes · OpenShift AI · DPO · Distributed Training · Kubeflow

Comments

{{ gettext('Login to leave a comment') }}

{{ gettext('Post a comment…') }}
{{ gettext('New comment') }}
{{ formTitle }}

{{ errorMsg }}

{{ gettext('No comments posted yet') }}

Hosted by

We are a Free and Open Source Software community at National Institute of Technology Calicut, Kerala