《AI后端SRv6.pdf》由会员分享,可在线阅读,更多相关《AI后端SRv6.pdf(15页珍藏版)》请在三个皮匠报告上搜索。
1、Changrong Wu,MicrosoftAbhishek Dosi,MicrosoftSRv6 for AI Backend NetworkSRv6 for AI Backend NetworkChangrong Wu,MicrosoftAbhishek Dosi,MicrosoftNetworkingNew Traffic Pattern:Small number of large flowsPeriodic bursts of data sent synchronously Dedicated Backend Network for AIArtificial Intelligence
2、in the CloudHost CPU,NIC,SSDGPUGPUGPUGPUHost CPU,NIC,SSDGPUGPUGPUGPUHost CPU,NIC,SSDGPUGPUGPUGPUAI Workloads/ApplicationsT0RLeafLeafT0RT0RThe bar for Hyperscale Datacenter Network is risingPower Supply,Physical Space,etc.limit the scale of a single DC site.The demand of GPU capacity from a single jo
3、b is set to grow beyond the capacity of a single DC.The AI backend network needs to connect geo-distributed GPU clusters at scaleContinental-scale GPU Cluster EmergingThe solution must be cost-effective and scalable No Proprietary TechnologyTraditional passive hash-based load balancing mechanisms su
4、ffered from low entropy problem.Need more active traffic engineeringFailures is inevitable at this scale Fast failover is necessaryMulti-path transport is desired for efficient bandwidth utilization Demand for fine-grained path controlChallengesSRv6 in AI Backend NetworkProvides fined-grained networ
5、k control based on source routingEnables path enumeration for traffic managementIntegration with AI workloads flow scheduling provides optimal network performanceAllow source to quickly reroute upon path failures or congestionPlane 1Plane 2Plane 3Plane 4SRv6 with uSID01T102T101T0usid0 xd0010 xd002Ds
6、tIPv6:fcbb:bbbb:d100:d001:d1e0:00a0:0 xd100180T1224T00 xd0b30 xd1e01sthop2ndhop3rdhop4th Hopnicnic0 x00a0fcbb:bbbb:d100:d002:d1e0:00a0:When congestion or failure is detected on 01T1NIC switches to new path by change the uSIDs encoded in the packet headerNO SRH Segment Identifiers(SIDs)are configured