《人工智能网络中的主动链路管理:来自 Meta 的经验教训.pdf》由会员分享,可在线阅读,更多相关《人工智能网络中的主动链路管理:来自 Meta 的经验教训.pdf(16页珍藏版)》请在三个皮匠报告上搜索。
1、Proactive Link Management in AI Networks-Lessons from MetaMeta Platforms IncBruno NovaisProduction Engineer/MetaHarshit Gulati(Presenter)Software Engineer/MetaProactive Link Management in AI Networks-Lessons from MetaNETWORKINGOutlineCall To Action Improved Link Management Traditional Link Managemen
2、t Motivation ContextThe Scale of the Challenge5,000+Optical CircuitsIn a 4k GPU cluster at leaf-spine level10,000+Optical TransceiversRequired for connections100,000+Total OpticsIn large-scale clustersImpact of Link FailuresRetransmission RequiredIncreases latencyPerformance DegradationLarge impact
3、with spraying of trafficJob InterruptionsWorkloads must restart from checkpointsBusiness ImpactCostly downtimeDesign ChallengesBreakout InterfacesSplitting high-speed ports increases failure points.A single 400G port becomes four 100G interfaces with more components.Fabric Interface TechnologiesComp
4、lex technology increases the need for better Signal Integrity and MonitoringManaged Network InterfacesEach additional interface requires monitoring and management.Operations blast radius increase during repairs.Sources of Link FailuresOptical Transceiver IssuesManufacturing defects or degradationSof
5、tware TuningMisconfigured parameters or driver tuningFiber ContaminationDust or debris causing signal attenuationPhysical RepairsIncreased complexity during maintenanceFirmware BugsUndetected issues softwareTraditional Approach to Link ManagementProvisioning Inject traffic from CPU to verify link st
6、ability and absence of CRC errors LiveReact to link flaps or errors and drain the linkRepairRepair the link in its current stateDetects link after they have affected training jobs Determining when to drain is a hard exerciseMarginal linksExample:Flaps once a dayRepeat OffendersEx