1、Revisit RoCEv2 issues in large scale deployment and the future that UEC promiseAMD and EdgecoreRevisit RoCEv2 issues in large scale deployment and the future that UEC promiseNETWORKINGPoWen TsaiDirector Technical Sales,Edgecore NetworksAzeem SulemanSr.Director Technical Product Management,AMDAgendaP
2、roblem StatementProductSolutionPerformanceQ&A0203040501Network UtilizationReliabilityScalabilityOperationsTCOInefficientGPU-to-GPU communicationLink,NIC and Switch failurePFC&Queue Pair stalls Elephant flows sharing Poor telemetry and lack of network state at CCLRequire deep buffer switches,lack of
3、multi-plane/rail networksAI Scale-out Networking ChallengesRoCEv2 Requires Improvements for modern GenAI&HPC deploymentsPFCCongestion ControlDifferent trafficsco-existsPFC requires at least BW*RTT+MTU buffering for fully lossless transmissionBlocked victim flowsPFC stormsDifferent DCQCN implementati
4、onsRoCEv2 core design natively does not support different transport protocols for different services.SecurityLink Level Reliability or Network ReliabilityFlexibility for End-to-End confidentiality and service protection.Large session state(keys)Delays become more significant as scale increases Requi
5、res error handling at link layer51.2Tbps while 1W per 100GbpsBest-in-Class SerDes that enable LPO(OSFP,QSFP)(AFO,AFI)complete portfolioAdaptive Routing&Cognitive Routing for all traffic types Improved Network Utilization Lowest Tail LatencyProgrammable out-of-band telemetry(6 ARM cores)and Programma
6、ble inband telemetry Minimized Packet Drops and Latency JitterEdgecore AIS800 Tomahawk 5 AI Switch AMD PensandoPollara 400 AI NICFully Programable Customizable TransportsOffload and AccelerationPCIeGen5,400G Scale-Out Choice No Fabric DependencyAMD PensandoPollara 400 AI NICP4-based architecture-72