《从人工智能加速器到系统的病毒和特征分析解决方案.pdf》由会员分享,可在线阅读,更多相关《从人工智能加速器到系统的病毒和特征分析解决方案.pdf(20页珍藏版)》请在三个皮匠报告上搜索。
1、Jeremy(Jinghan)YangHardware Systems Engineer,Meta Power Virus and Perf evaluation strategy for AI accelerator systems Samu Chakki Hardware Systems Engineer,Meta Richa MishraHardware Systems Engineer,Meta Power Virus and Perf evaluation strategy for AI accelerator systems Jeremy(Jinghan)YangHardware
2、Systems Engineer,MetaSamu Chakki Hardware Systems Engineer,MetaSERVER:AI HW SW CO-DESIGN/NIC/HPCRicha MishraHardware Systems Engineer,MetaWhy do we need Power Virus and Application Power Characterization in AI accelerator system Strategy overview Engineering workstreams Next steps and Call for indus
3、try collaboration Agenda With fast growth of compute demand to power AI accelerator and systems,we see dramatic power increase from Silicon,module to compute/network nodes all the way to rack and beyond.Power Virus and Application workload Characterization deliver coverage to support Power supply/VR
4、 stabilityThermal characteristicsCooling System Qualification ReliabilityThis efforts will help to refine TDP spec points.CSP can further to optimize the efficient power capacity planning.Context Source:Practices and insights into liquid cooling on Metas AI training platforms.Author:Cheng Chen,Yin H
5、ang,Noman Mithani,Chris Malone,Yueming Li,Wenying Zhang,John Fernandes,Kalpak Dhake,Jaret Wyatt,Jarrod Clow,Darron YoungLook into AI accelerator Power DefinitionPowerTime Pmax/EDPTime scale of of s 0.90 xModel assumptions:Temp:85C,Compute,TT partAverage powerWW 0.90 xModel assumptions:Temp:85C,Compu
6、te,CIP,TT partKernel duration Peak powerMonitoring and Telemetry Sideband report Accelerator,Module,system platform and rack level power,thermal,current,voltage sensor,errors.Inband report IO bandwidth,throughput,latencyProcessing Core utilization,performance counters Error statu