My 8x H100 AI training rig - 6 months of planning, finally done
AI / ML Server4 posts · 0 views
M
MLEngineer_KOP
1d ago#1
After 6 months of sourcing parts and 3 weeks of build time, my 8x NVIDIA H100 SXM5 cluster is finally online. Running on a Supermicro HGX H100 carrier board, dual EPYC 9654, 3TB DDR5 ECC RAM. Total cost landed around $140k. Happy to answer questions about the build process - it was not straightforward.
G
GPUHoarder
1d ago#2
This is absolutely insane. What are you training on it? Also curious about the power setup - 8x H100 SXM5 is like 5600W just for the GPUs.
M
MLEngineer_K
1d ago#3
Running fine-tuning jobs for a few LLMs. Power is handled by dual 10kW PDUs, 3-phase 208V circuit. The cooling alone took a dedicated 5-ton CRAC unit.
T
ThermalPete
1d ago#4
Love to see a full writeup on the cooling solution if you have time. SXM form factor in a homelab environment must have been a nightmare to cool.