Weight Transfer for RL Post-Training in under 2 seconds

https://news.ycombinator.com/rss Hits: 7

Summary

We recently achieved 1.3-second cross-machine parameter updates for Kimi-K2 (1T parameters), transferring weights from 256 training GPUs (BF16) to 128 inference GPUs (FP8).In asynchronous reinforcement learning fine-tuning, training and inference run on separate GPUs. After each training step, new weights must be pushed to inference nodes. Many existing frameworks take several seconds—or even minutes—for trillion-parameter models.By leveraging RDMA point-to-point communication, we are able to make the weight transfer blazing fast, without changing inference engine, and make the code easier to write and maintain.RDMA WRITE: one-sided transfersOur solution is built on RDMA WRITE, a one-sided primitive where the source directly writes into the destination’s GPU memory.def rdma_write(src_ptr, dst_ptr, size, src_mr, dst_mr): ...The destination side won’t even get notified for the transfer. This gives us low-latency, high-throughput, zero-copy transfers driven by the training nodes without any control logic on the inference nodes.High-level workflowMetadata collection – Controller gathers parameter metadata from all training and inference GPUs.Schedule computation – Controller computes a static weight transfer schedule, mapping which training GPU sends which parameter to which inference GPU, and in what order.Schedule distribution – Controller sends the schedule to all training GPUs.Execution – After each training step, the controller signals training GPUs to start transfers.Weight transfer executionWith the high-level workflow defined, the key challenge is how to execute weight transfers efficiently at trillion-parameter scale. Here we describe the details of the execution path.DeviceMesh and Mesh GroupsParameters in training are distributed according to FSDP placements. Using full_tensor(), all GPUs in a DeviceMesh can reconstruct the full parameter, hence all can serve as a source for weight transfer.Multiple disjoint DeviceMeshes form a mesh group. Because DeviceMeshe...

First seen: 2026-01-19 22:32

Last seen: 2026-01-20 04:32

Read Full Article More from this Source

Weight Transfer for RL Post-Training in under 2 seconds

Summary

Related News

Show HN: Dwm.tmux – a dwm-inspired window manager for tmux

FBI is investigating Minnesota Signal chats tracking ICE

Xfwl4 – The Roadmap for a Xfce Wayland Compositor

Aperture: Senior QA (2004-2005)

Rust's Standard Library on the GPU