《面向大规模人工智能集群的液冷解决方案.pdf》由会员分享,可在线阅读,更多相关《面向大规模人工智能集群的液冷解决方案.pdf(15页珍藏版)》请在三个皮匠报告上搜索。
1、Liquid Cooling Solutions for Large-Scale AI ClustersSupermicroLiquid Cooling Solutions for Large-Scale AI ClustersDaniel KapesaProduct Manager/SupermicroAI CLUSTERSOutline 54321AI Cluster Workloads ChallengesLiquid Cooling FundamentalsFacility-Level Heat RejectionActionable Strategies for Deployment
2、Call to ActionAI Cluster Workloads ChallengesxPUsPower TrendAI power demand requires new power delivery&cooling approachesAI vs Compute PowerRubin NVL576 AI workloads generate unprecedented heat densities(multi-kilowatt GPUs per node).Managing heat efficiently is critical to maintaining performance
3、and reliability.Traditional air-cooling faces physical and efficiency limits at scale.Thermal Challenges in AI ClustersDesignPowerThermalsDirect liquid cooling removes heat at the source(cold plates on CPUs/GPUs)Higher heat transfer efficiency than air coolingKey parameters:coolant temperature,flow
4、rate,pressure,redundancyLiquid Cooling FundamentalsAdditional cold plates:Remove90%+of system heatCovers:DIMMs,VRMs,PCIe,PSUsLeveraging OCP CollaborationModular building blocks for scalable liquid-cooled AI clustersModular components:cold plates,coolant distribution units(CDUs),manifoldsScalability
5、and serviceability considerations for hyperscale deploymentsImportance of balanced coolant flow and temperature controlSystem Architecture OverviewVertical CDMs Increased server density per rack Enhanced serviceability&maintenance Front I/O for cold aisle access Front NIC cabling;rear liquid cooling
6、 and power cablesMechanical and fluidic interface complexity in dense racksLeak prevention and maintenance accessibilityMonitoring coolant quality,temperature,and flow in real timeIntegration ChallengesRack Level leakage MechanismFactory-tested hose kits with pre-installed sealsLeak detection sensor