《PCI Express HW Fault Management RAS Solution Implementation considerations in Metas AI-ML Training Clusters.pdf》由会员分享,可在线阅读,更多相关《PCI Express HW Fault Management RAS Solution Implementation considerations in Metas AI-ML Training Clusters.pdf(12页珍藏版)》请在三个皮匠报告上搜索。
1、PCI Express HW Fault Management(RAS)Solution Implementation Considerations in Metas AI/ML Training ClustersHardware ManagementPCI Express HW Fault Management(RAS)Solution Implementation Considerations in Metas AI/ML Training ClustersAnil Agrawal,HW Systems Engineer,MetaGada Badeer,HW Systems Enginee
2、r,MetaAI/ML Cluster OverviewFault domainsError handlingCall to actionAgendaAI/ML:Artificial Intelligence/Machine LearningAI/ML Cluster 30K ft viewAI/ML Cluster-OverviewReference:https:/ Cluster-Platform ViewOAM:OCP Accelerator ModuleCompute NodesAI/ML Cluster-PCIe HierarchyExample:Just one slice of
3、the entire PCIe device HierarchyB:D.F root_port,slot#,device present,speed 8GT/s,width x16B:D.F upstream_port,PCIe Switch.B:D.F downstream_port,slot#,device present,speed 16GT/s,width x16 B:D.F upstream_port,PCIe Switch B:D.F downstream_port,slot#,device present,speed 16GT/s,width x16 B:D.F endpoint
4、,OAM B:D.F endpoint,PCIe Switch B:D.F downstream_port,slot#,device present,speed 16GT/s,width x16 B:D.F upstream_port,PCIe Switch B:D.F downstream_port,slot#,device present,speed 16GT/s,width x16 B:D.F endpoint,OAM B:D.F downstream_port,slot#,device present,speed 8GT/s,width x4 B:D.F endpoint,SSD#,N
5、VMe SSD Controller B:D.F downstream_port,slot#,device present,speed 16GT/s,width x16 B:D.F endpoint,RNIC#,RDMA NIC B:D.F downstream_port,slot#,device present,speed 16GT/s,width x16 B:D.F endpoint,RNIC#,RDMA NIC B:D.F downstream_port,slot#,device present,speed 8GT/s,width x4 B:D.F endpoint,SSD#,NVMe
6、SSD Controller B:D.F downstream_port,slot#,device present,speed 8GT/s,width x4 B:D.F endpoint,SSD#,NVMe SSD Controller B:D.F downstream_port,slot#,device present,speed 8GT/s,width x4B:D.F endpoint,SSD#,NVMe SSD ControllerB:D.F endpoint,PCIe Switch A Large PCIe Device Hierarchy Increased Platform Fai