1、Rama Bhimanadhuni,MicrosoftJicksen Joy,GoogleTaniya Siddiqua,AMDAdvancing Hyperscale AI Fleet Quality Through Standardized Debug,Diagnostics and RASAdvancing Hyperscale AI Fleet Quality Through Standardized Debug,Diagnostics and RASRama Bhimanadhuni,MicrosoftJicksen Joy,GoogleTaniya Siddiqua,AMDHARD
2、WARE MANAGEMENTStandardizationfor Faster Adoption and Fleet QualitySupplier benefit:Engage seamlessly with multiple hyperscalersReduce engineering overhead Hyperscalers benefit:Onboard diverse suppliers systems-GPU(3P and 1P),CPUs-X86 and ARMAccelerate innovation and fleet quality AI Infra Fleet Qua
3、lity ChallengesPain PointsStrategic Focus Areas Single node failure causes large blast radius Reliability,Availability,Serviceability Difficulty meeting 95%NIS(Nodes In Service)metrics Diagnostics improvements Insufficient crash and debug dumps for RCA Fleet-scale debug features Lack of standards le
4、ads to engineering toil and delays OCP-based standardizationHyperscale CPU ManagementOCP CLA Work Groups:Formed by AMD,Google and Microsoft expanded with ARM,Intel and Meta.Standardization Initiative:First industry effort to standardize Diagnostics,Debug requirements.Contributions and Progress:0.5 v
5、ersion in May 20250.7 version in Oct 2025OCP Standardization for Diag,Debug and RASHyperscale GPU ManagementOCP CLA Work Group:AMD,Google,Meta,Microsoft,Google,NVIDIAStandardization Initiative:First industry effort to standardize GPU RAS requirementsContributions and Progress:Published 1.0 version i
6、n Oct 2024Version 1.7 release planned in Oct 2025 Challenges with current diagnostics industry ImplementationHigh engineering overhead To Integrate custom diagnostics to support diverse supplier componentsComplex IntegrationTools have hidden dependencies and assumptions about HW/FW/SW(e.g.,kernel ve