My 8x H100 AI training rig - 6 months of planning, finally done

AI / ML Server4 posts · 0 views
M
MLEngineer_KOP
1d ago#1

After 6 months of sourcing parts and 3 weeks of build time, my 8x NVIDIA H100 SXM5 cluster is finally online. Running on a Supermicro HGX H100 carrier board, dual EPYC 9654, 3TB DDR5 ECC RAM. Total cost landed around $140k. Happy to answer questions about the build process - it was not straightforward.

G
GPUHoarder
1d ago#2

This is absolutely insane. What are you training on it? Also curious about the power setup - 8x H100 SXM5 is like 5600W just for the GPUs.

M
MLEngineer_K
1d ago#3

Running fine-tuning jobs for a few LLMs. Power is handled by dual 10kW PDUs, 3-phase 208V circuit. The cooling alone took a dedicated 5-ton CRAC unit.

T
ThermalPete
1d ago#4

Love to see a full writeup on the cooling solution if you have time. SXM form factor in a homelab environment must have been a nightmare to cool.

Post a Reply