Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The guide does not say how efficient this run was in terms of GPU utilization (tops/theoretical max tops).


Hey there are some details about this scattered throughout. The answer really depends on the technique. For DDP you can fairly easily get same throughput as single gpu throughput (we were getting ~80% gpu util for multiple nodes iirc), as long as all the workers are getting the same sized data.

Once you move to training really large models like Llama 405B with FSDP and use things like CPU offloading, the throughput goes down quite a bit due to all the data transfers between CPU/GPU. If you have large enough clusters and don't have to use CPU offloading, you can get higher throughput.


You are talking about a specific setup:

> Here we are going to utilize an 8 node cluster (64 H100 GPUs)




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: