《CSIG:人工智能时代的拥塞信号.pdf》由会员分享,可在线阅读,更多相关《CSIG:人工智能时代的拥塞信号.pdf(16页珍藏版)》请在三个皮匠报告上搜索。
1、Abhiram Ravi(Google)Jai Kumar(Broadcom)CSIG:Congestion Signaling in the AI eraCSIG:Congestion Signaling in the AI eraAbhiram Ravi(Google)Jai Kumar(Broadcom)OCP SPECIAL FOCUS:OCP SPECIAL FOCUS:ARTIFICIAL ARTIFICIAL INTELLIGENCE INTELLIGENCE(AI)(AI)Continuing trends in the AI era:Horizontal scaling is
2、 inevitableExtreme reliability,performance and efficiency requirements for scale-up and scale-out networks serving AI workloads AI workloads are extremely bandwidth-hungry and tail latency-intolerantNew norms for network congestion in AI workloadsAI Workloads:Era of Extreme Network DemandsCentral ob
3、servation:Accurate and fine-grained congestion signals needed for observability and controlCentral observation:Accurate and fine-grained congestion signals needed for observability and controlMassive,synchronized burststhat amplify as the network fabric scalesCongestion events that manifestat sub-mi
4、llisecondtimescaleson network switchesPredictable and repeating patterns of short-lived congestionMany control loops operate at different timescales to Efficiently utilize available network capacity at fine-grained timescalesEnable tight guarantees on tail latency and throughput for collectivesConge
5、stion control,load balancing,multipathing,scheduling,traffic engineering,provisioningCongestion control,load balancing,multipathing,scheduling,traffic engineering,provisioningAccurately detecting congestion locallyon a switch requires signal measurements at sub-millisecondtimescalesHigh-resolution n
6、etwork signals are necessaryPort Tx utilization(1 secondresolution)Port Tx utilization(100 microsecondresolution)Real-world example from a GPU ToRat Google:Shifting from 1-second to 100-sec telemetry exposes the fine-grained,repeating congestion patterns and idle gaps inherent to AI workloadsline-ra