《利用强化学习实现功耗受限下GPU的热安全运行.pdf》由会员分享,可在线阅读,更多相关《利用强化学习实现功耗受限下GPU的热安全运行.pdf(26页珍藏版)》请在三个皮匠报告上搜索。
1、Thermal-Safe Operation for GPU under Power Constraints Using Reinforcement LearningNational Taiwan UniversityThermal-Safe Operation for GPU under Power Constraints Using Reinforcement LearningYu-Han Chiu Tsung-Kuang LiaoJia-Han LiNational Taiwan UniversityJie-Hong HouChien-Er LaiShih-Wen ChenChao-Ch
2、ing HoNational Taipei University of TechnologyHung-Hsuan LinDelta Electronics,Inc.FUTURE TECHNOLOGIES SYMPOSIUMOutline4321IntroductionMethodologyResults and DiscussionConclusionBackground With the rapid growth of machine learning and generative AI tasks in recentyears,the demand for GPU throughput f
3、rom both individuals and enterprises hassignificantly increased.To deliver greater computational power,it has become necessary to add moreprocessing units,which in turn raises the power consumption design of GPUs.IntroductionMotivation Cooling Limit:IntroductionAir cooling in a tall 1U(1.75 inches)c
4、hassis can dissipate only about250 W,and even in a 2U chassis,only up to 500 Walreadyapproaching the limit.To further increase power,liquid cooling becomes necessary.As it needs quick response for AI server,it decreases the thermalbuffer time,narrows the reaction window,and increases the risk ofover
5、heating shutdowns.Motivation Cooling Failure Has Become One of the Major Risks:IntroductionAccording to data center incidentreports,cooling system failure isthesecondleadingcauseofunexpected downtime.Whenairflowisobstructedorexternalcoolingisinterrupted,GPU temperatures can rise sharplywithin a shor
6、t time.Figure 1.Ratio of Shutdown CausesD.Donnellan and A.Lawrence.Annual outage analysis 2024:The causes and impacts of IT and data center outages(executive summary).Technical Report 131,Uptime Institute Intelligence,New York,NY,Mar.2024.Motivation Fan Failure Experiment:IntroductionThrottling at 8