《构建人工智能架构:面向下一代人工智能服务器、机架和集群的可扩展网络.pdf》由会员分享,可在线阅读,更多相关《构建人工智能架构:面向下一代人工智能服务器、机架和集群的可扩展网络.pdf(24页珍藏版)》请在三个皮匠报告上搜索。
1、Architecting the Al FabricMetaArchitecting the Al FabricJalpa PatelTechnical Program Manager/MetaAI CLUSTERSLarger AI workloadsSoftware requirementsHardware and Network requirementsData CenterChallenges ahead of usAgendaLlamasScaleSoftware InfraRunning larger AI workloadsHardware and Network Infra D
2、C Infra LlamaScaleRunning larger AI workloadsLlamaSoftware Software Job SchedulingJob SchedulingCheckpointingCheckpointingFault ToleranceFault ToleranceModel Distribution on GPUsTENSOR TENSOR PARALLELPARALLELTENSOR TENSOR PARALLELPARALLELPIPELINE PIPELINE PARALLELPARALLELPIPELINEPIPELINEPARALLELPARA
3、LLELData ParallelSynchronizationGPU1.Technical content is desiredFind Model Sharding Combination,least Sensitive to Network LatencyCo-design Model Sharding with Network Latency/Routing Artifacts2.Modeling,Simulation and ValidationTopology Aware Model Parallelism AssignmentTopology Awareness in Job S
4、cheduler and Model parallelismassignment 3.New Collective AlgorithmsCollective Library Changes,Topology AwarenessMitigating the Impact of Network LatencyNew Collective Algorithms cause:More Congested/New Collective Patterns within the buildingA lot more data across the Buildings-ensuring routing nee
5、ding to be perfect.This means we need Network Routing Efficiency to be Higher than it is todayTwo Directions of Solutions:Packet Spraying and ReassemblyCollective Software Based Load BalancingMitigating the Impact of New Collective AlgorithmsScale Scale Hardware&Hardware&Network Infra Network Infra
6、Running larger AI workloadsLlamaNetworkNetworkFleet HealthFleet HealthHW HealthHW HealthAvenues of Flexibility-TechnologyTechnologyDSFNSF-Forwarding Requirements-DLB/ECMP Scalability-Low Latency-Less Cost-Easier cabling fit-Distance Limitations-VoQ Scalability-Load-Balance in HW-