《数据中心演进:增强人工智能的信任.pdf》由会员分享,可在线阅读,更多相关《数据中心演进:增强人工智能的信任.pdf(27页珍藏版)》请在三个皮匠报告上搜索。
1、Chris VerneDirector,AI&Infrastructure,GoogleData Center EvolutionAI Model Compute ScalingExplosive growth in deployed ML capacity Increasing demand for space and powerPerformance GrowthEfficiency Growth6004002000201820202022202420262028Compute&StorageAI/MLAI power demand requires new power delivery&
2、cooling approachesGoogle Contributions to OCPMt Diablo 0.5 spec published,Enabling 1 MW rack using+/-400VdcProject Deschutes CDU 0.75 spec submitted to OCP,on the portal soon Google Contributions to OCPSolving problems together01SecurityUtilize standards,RTMs,and modularity to build secure systemsCo
3、mposable Security ArchitectureIntegrated Root of TrustCaliptra formed in 2022,led by Google,Microsoft,AMD&NVIDIADetects if SOC FW is compromised Open source HW at CHIPS Alliance;Caliptra 2.0 released with PQC Requires keys from multiple parties to unlock a storage deviceManaged using TCG Opal protoc
4、ol0.85 spec available nowImplementation part of Caliptra 2.1User 1User 2AdminLayered Open-source Cryptographic Key-managementSecurity Appraisal Framework and EnablementStandardizes security audits of HW/FW components(e.g.,xPUs,SSDs)02ResilienceHidden Enemy Silent Data CorruptionBug-free workload pro
5、duces incorrectresults without any indication2021:“Cores that dont count”2023:“Training at unprecedented scale invariably surfaces new and interesting systems failure modes”Uniting Against SDCStandard test input&output formats,part history,metrics,test framework&flowPaper to be published OCP this mo
6、nth on SDC in AIUse best practices from HPC&Cloud computing for AIDefines open research questions for academic collaborationLeveraging Learnings03Validation1.0 specs publishedEnabling TTM and higher quality GPUsReliable,Seamless GPUStandard crash dumps and deb