《利用实时健康监测增强人工智能硬件和高性能计算中的 RAS.pdf》由会员分享,可在线阅读,更多相关《利用实时健康监测增强人工智能硬件和高性能计算中的 RAS.pdf(12页珍藏版)》请在三个皮匠报告上搜索。
1、OCP EMEA Summit 29 April|Dublin,IrelandGuy Gozlan,proteanTecsEnhancing RAS in High-Performance Computing with Real-Time Health MonitoringSemiconductors and RASPhysicsSmaller geometries,complex architecturesSoftware stress High-performance applications with increasing/changing workloadsHyper-competit
2、ion Less margins in design,less time to test,shorter time to tape outCost Cannot keep up with demand,refresh delayed(4-6 years),HW needs to last longer more time for failureOperational Lower operational voltages,increased workload demands,unpredictable future workloadsScale High volumes and all conn
3、ected via system clustersFunctional failuresSilent data corruptionSystem-wide errorsCurrent ApproachesSlow ResponseLacking LocationComplex and expensive integrationBIST Running only at startupproteanTecs Multi-Pillar TechnologyNative solution Ecosystem agnostic Smart integrationIP&EDA On-chip HW mon
4、itoring system,integration&implementationSoftware ApplicationsCloud&edge analytics for actionable insightsIn-Production TestingIn-FieldOn-Cloud(SW)On-Board(SW)On-Tester(SW)On-Cloud(SW)On-Board(SW)In-Chip(FW)Real path monitoring High-speed clock samplingPPA adherentFull embedded HW system Monitoring
5、Margin to Timing FailureHigh-coverage&continuous monitoring of actual performance limiting paths with on-chip AgentsAt test and in mission-modeExtreme high coverage of performance limiters pathsSensitive to:-Workload stress-Latent defects-Operating conditions-DC IR drops&local Vdroops-Hot spots-Agin
6、gSufficient Timing MarginLow Timing MarginCritically low Timing MarginLegendDemonstration in 5nm Communication SystemThis slide will include a 90 second video of the RTHM running in a real customer system(with voiceover by the speaker to explain what were seeing)Performance Index