《利用下一代网络加速人工智能发展:SONiC 的创新和可扩展设计.pdf》由会员分享,可在线阅读,更多相关《利用下一代网络加速人工智能发展:SONiC 的创新和可扩展设计.pdf(38页珍藏版)》请在三个皮匠报告上搜索。
1、Kamran NaqviChief Network Architect-EMEAAccelerating AI with Next-GenNetworking:SONiC Innovationsand Scalable DesignsAccelerating AI with Next-Gen Networking:SONiC Innovations and Scalable DesignsKamran Naqvi BroadcomARTIFICIAL INTELLIGENCE(AI)AI Networking FabricsWhat makes Scale-out networking uni
2、queClos vs Rail-optimized designsSONiC Enhancements for AI NetworkingEthernet for Scale-up Call to ActionsAgendaAI Networking FabricsEnterpriseOOB NetworkFrontend FabricBackend FabricStorage FabricComputeOOB MGMTAI Scale-up and Scale-out NetworkingScale-upScale-outWhat Makes Scale-out Networking Uni
3、queHigh BandwidthElephant flowsSynchronized and bursty trafficRDMA dominant trafficTraining jobs run for long periods of time(hours,days)Tail latency impacts job completion time significantly Synchronized transmission,immediate links saturation Job Completion Time(JCT)derivedfrom the last flow to co
4、mplete“Time Spent in Networking”is Impacted By“Time Spent in Networking”is Improved ByIn case of link failure,recovery should happen in HW,Zero Impact Failover(ZIF)“Time Spent in Networking”is Improved ByReceiver-based credit control can pace senders accuratelyCredit control mechanism can exist on t
5、he switch or the endpointBroadcoms AI Networking SolutionsSwitch ScheduledEndpoint ScheduledBroadcom NICMerchant silicon NICCustomer NICGPU native Ethernet interfaceEndpointcan beEthernet Beats InfiniBand:10+%Imp in JCT130,00120,00110,00100,0090,0016MB32MB64MB128MB256MB512MB1024MBInfiniBand(Gbps)Eth
6、ernet(Gbps)Bus Bandwidth(Gbps)Ethernet Provides 30 x Faster Failover than InfiniBandEthernetInfiniBand*Typical industry failure rate.*Assuming 4K node cluster using 9.2K optic modulesRecovery time(microseconds)Ethernet is the De-facto AI NetworkHyperscalers:Ethernet AI fabric60,000+30,000+30,000+100