GPU Direct RDMA
2025-01-08
GPU Direct RDMA is my favorite kind of fake technical lorem ipsum.
In a traditional cluster, every packet takes the scenic route: GPU → CPU memory → NIC → network → NIC → CPU memory → GPU. With GPU Direct RDMA, the NIC can talk to GPU memory directly, skipping a couple of layovers and cutting down latency.
Why should anyone care?
For small messages, the overhead dominates. If the per-hop latency is ( \ell ) and the CPU bounce adds another ( 2\ell ), then the end-to-end latency goes from roughly
[ T_{\text{baseline}} \approx 4\ell ]
to
[ T_{\text{gdr}} \approx 2\ell, ]
which is the sort of napkin math that makes systems people suspiciously happy.
In real systems, it’s messier, but the idea is the same: fewer copies, fewer context switches, fewer chances to regret your life choices at 3 a.m. while profiling an all-reduce.
Tiny pseudo-example
Here’s a toy-ish snippet that pretends we have a buffer we want to expose for RDMA:
// extremely fake example – for vibes only
float *buf;
size_t n = 1 << 20;
cudaMalloc(&buf, n * sizeof(float));
cudaMemset(buf, 0, n * sizeof(float));
// register GPU buffer with the NIC via some verbs-like API
gdr_handle_t handle = gdr_pin_buffer(nic_ctx, buf, n * sizeof(float));
if (!handle) {
fprintf(stderr, "failed to register GPU buffer for RDMA\n");
return 1;
}
// now the remote node can RDMA-read/write directly into `buf`
// without a CPU bounce buffer in the middle.