《无限 HBD:跳出固有思维模式进行规模化.pdf》由会员分享,可在线阅读,更多相关《无限 HBD:跳出固有思维模式进行规模化.pdf(18页珍藏版)》请在三个皮匠报告上搜索。
1、Bobby Lu-LightelligenceInfiniteHBD:Scale-up Outside the BoxInfiniteHBD:Scale-up Outside the BoxBobby Lu-LightelligenceSource:Jaime Sevilla and Edu Roldn(2024),Training Compute of Frontier AI Models Grows by 4-5x per Year.Published online at epoch.ai.Retrieved from:https:/epoch.ai/blog/training-compu
2、te-of-frontier-ai-models-grows-by-4-5x-per-year online resourceTraining Compute is Rapidly ScalingMulti-dimensional parallelismLow CommunicationData Parallelism(DP)Pipeline Parallelism(PP)Context Parallelism(CP)Sequence Parallelism(SP)Intensive CommunicationTensor Parallelism(TP)Expert Parallelism(E
3、P)xPU to xPU Scale-Up NetworkLow Latency:RTT 1TBpsHow does the Datacenter Support LLM Training?Switch-centric:Fat tree style of High Bandwidth Domain(HBD)Using many high radix switches to provide high bandwidth,perfect uniform,non-blocking any to any communicationAdditional computing unit to provide
4、 crucial redundancy and serviceabilityMajor Scale-Up Networks UALinkSUENVLinkSource:semianalysis.Retrieved from:https:/ resourceSource:UALink Specification,UALink_200 Rev 1.0Source:Scale Up Ethernet Framework Specification,Scale-Ethernet-RM102Scalability requires high radix switchesChallenges for Sw
5、itch-Centric TopologyScalability requires high radix switchesResource fragmentationSwitch-level fault explosion radiusSolution:Disaggregate the aggregatorChallenges for Switch-Centric TopologyUnusable Bandwidth degradation Transceiver-centric HBDUnify connectivity and switching by using OCS Transcei
6、ver(OCSTrx)InfiniteHBDOCS TransceiverReconfigurable K-Hop RingHBD-DCN OrchestrationC.Shou et al.,SIGCOMM25,September 811,2025,Coimbra,Portugal,https:/doi.org/10.1145/3718958.3750468Module specQSFP-DD formfactor 8ch TX+8ch RX Linear Drive Silicon Photonics OpticsTotal BW up-to 800Gbps single directio