1、Ahmad Byagowi,Turba.aiAmit Oren,BroadcomBhaskar Chinni,BroadcomPTP in AI NetworksPTP in AI NetworksAhmhad Byagowi,Turba.aiAmit Oren,BroadcomBhaskar Chinni,BroadcomOCP Special Focus:Artificial Intelligence(AI)Introduction:Need for time synchronization in AI networks Phantom Jam,Phantom Traffic and Ph
2、antom DelayHow TCP determines channel capacityUse cases&benefits of delay awarenessTest dataConclusionAgendaAn emerging behavior of cascading controllersPhantom JamSource:https:/ Increase,Multiplicative DecreaseHow TCP Determines Channel Capacity?Source:https:/ Open Loop backed with Time Slices inst
3、ead of independent controllersPotential SolutionImportance of network for AI workloadsXPUXPUHBMHBMHBMHBMXPUXPUHBMHBMHBMHBM4 x HBM3E(9.6Tbps)38.4Tbps8 x HBM4(12.8Tbps)102.4TbpsBesides improvements in the network speeds,efficiency is also importantEfficiency means effective traffic schedulingOne way l
4、atency(OWL)can be an effective tool for traffic schedulerOWL requires precision time in all the nodesPrecision time is a product of time synchronizationPTP for Network Efficiency(for OWL capability)OWL from host A to host B is the time between As NIC transmit timestamp and Bs NIC receive timestamp f
5、or the same packet.Unlike RTT/2,OWL captures asymmetry(different paths/queuing in each direction)which is common in Clos/leaf-spine fabrics.Why OWL matters in AI workloads:Collectives(e.g.,ring/tree all-reduce)and MoE token routing are barrier-sensitive;tail OWL(p99/p99.9)often controls step time ev
6、en when average latency is low.Microbursts(incast to a single ToR egress)can create millisecond-class queueing spikes that dominate p99 OWL.Production-grade measurement patterns(hardware-assisted):Clock sync:Use PTP(IEEE 1588/802.1AS)with boundary/transparent clocks so both endpoints NIC PHCs are al