Apr 2026
13 Mon
14 Tue
15 Wed
16 Thu
17 Fri
18 Sat 09:00 AM – 06:00 PM IST
19 Sun 09:00 AM – 06:00 PM IST
RamaChaitanya Kandula
Submitted Mar 18, 2026
Distributed AI training runs over RDMA and depends on each GPU using the right NIC. When the scheduler assigns a rank to a NIC that isn’t the best for that GPU (e.g. cross-NUMA or extra PCIe hops), you can lose 20–40% throughput or see unstable runs. Most cluster schedulers don’t understand PCIe/NUMA topology, so we need a small, reliable tool that discovers how NICs and GPUs are connected and outputs placement recommendations—which rank should use which NIC and, optionally, which CPU/NUMA node.
This session walks through building that tool in Rust. We’ll cover where topology comes from (nvidia-smi, sysfs, netlink), how to model NIC–GPU–NUMA affinity and score “best NIC per GPU,” and how to emit placement (e.g. env vars or JSON) for the launcher or scheduler. We’ll show why Rust is a good fit: safe parsing of vendor output, clear data structures for devices and affinity, no unsafe in the policy layer, and a small binary that fits into node agents or sidecars. The same approach applies to any environment where NIC topology and placement matter for AI or HPC.
Key takeways from the session includes:
This talk will be relavent to developers and SREs working on AI/ML infrastructure, distributed training, or GPU clusters.
I am senior software engineer at Nutanix Networking team, focused on network acceleration technologies, with experience in high-performance networking and GPU clusters. He is exploring Rust for systems and networking tooling and is interested in NIC topology and placement for AI workloads.
Hosted by
{{ gettext('Login to leave a comment') }}
{{ gettext('Post a comment…') }}{{ errorMsg }}
{{ gettext('No comments posted yet') }}