1、Mohammad HanifAjay ChhatwalOptimizing AI Networks with Advanced Congestion ManagementOptimizing AI Networks with Advanced Congestion ManagementMohammad HanifAjay ChhatwalOCP Special Focus:Artificial Intelligence(AI)Why congestion management matters in AI networks?Types of AI Networks Scale-out and S
2、cale-upCongestion management in Scale-out networksBTS notifications PFC Aware ECN MarkingPacket TrimmingCSIG(Congestion Signaling)Congestion management in Scale-up networksEthernet for Scale-up NetworkingCBFCCall To ActionAgendaHigh bandwidth and low latency for optimal job completion timesTail late
3、ncy impacts job completion time significantlySynchronized and bursty trafficElephant flows with low entropyWhy congestion management matters in AI networks?Scale-up and Scale-out AI NetworksScale-upScale-outIn RackAcross RacksDatacenter1Datacenter2Across RacksSpineLeafSpineLeafAcross Data CentersBac
4、k to the Sender(BTS)notificationsPFC Aware ECN MarkingPacket TrimmingCSIG(Congestion Signaling)Congestion Management in Scale-out NetworksInstead of ECN Marking,upon congestion detection switch generates CNP and sends to the source directly Performance gainReduces the delay in congestion control loo
5、p(no blocking by PFC)Can send additional information about the congestion(example location and severity of congestion)back to the sourceNote that CNP generation even at the last-hop(DTOR)is beneficialIf Dest.NIC sends PFC it is not blocking CNP generationDelay of CNP generation at the dest.NIC can b
6、e high if it is done in firmwareFast CNP Generation in the ToR/Spine SwitchSNICSTORSpineSpineDTORDNICCNPCNPNotification generated by switch for congested packetSends notification back to sender(BTS)of original congested packetNotifies Node ID with Queue ID and Queue Length where the congestion occur