1、Power stabilization for AI training datacentersMicrosoftOpenAINVIDIAA true cross-company collaborationhttps:/aka.ms/PowerStabilizationMicrosoftOpenAINVIDIAas nodes simultaneously transition between compute-intensive(high power)and communication-intensive(low power)phasesProblem statementPower oscill
2、ations can cause generation equipment damage and flickerCan de-stabilize the interconnectionPreviously seen failures by the industryHigh Frequencies3 30 HzLow Frequencies0.1 2 HzTime domainMax permitted rate of increase in power demand(MW/s)(MW/s)Allowed short-term deviation in power drawbefore ramp
3、 constraints are triggeredFrequency domainFor each rangeExploring solutions across the stack GPU power shaping to meet datacenter requirements NVIDIA GB200 implementationBounding box controller Minimum power floor/MPF 20%-90%of TDPRamp rate controller Ramp-up/down rates,Hysteresis for ramp down In-b
4、and and out-of-band support Cumulative lifetime associated with the feature Accounting for the EDPp and MPF range,achievable swing of 20%Can work in synergy with other mitigation methodsGPU power smoothing/Min Power Floor(MPF)Power-hungry secondary workload Artificial workload,or a low-priority usef
5、ul job Low context to avoid performance impact to the training job 5%impact achieved using MPS Ability to increase consumption up to 100%of the TDP Telemetry GPU activity counters Fine-grained telemetry Start-up and back-off mechanism Trade-off between potential performance impact and power swing ma
6、sking granularitySoftware Mitigation(Firefly)An energy-storage solution 1.That directly measures the load,has enough capacitance to support the workload,2.Meets the sudden rise/drop needs in power,and switches modes between charging and discharging quickly.Ene