《SONiC 的 AI 集群规模化方法.pdf》由会员分享,可在线阅读,更多相关《SONiC 的 AI 集群规模化方法.pdf(22页珍藏版)》请在三个皮匠报告上搜索。
1、SONiC Scale Up Introduction Eddie Ruan,Alibaba Riff Jiang,MicrosoftAI NetworksIn RackIn-RackScale-Up NetworkAcross RacksScale-Out NetworkAcross RacksAcross Data CentersScale-Up vs Scale-OutScale-Up NetworkScale-Out NetworkNumber of machinesMultiple interconnected nodesIndependent nodes with distribu
2、ted resourcesExamplesNVL72,PCIeEthernet based AI clustersCommunication CharacteristicsLow latencyHigh bandwidthMemory load-store-atomicsSmaller transfersHigher latencyLower bandwidthMessage transferLarge size transfersScalabilityLimited by GPU and cluster designHorizontal ScalingJob schedulingAlmost
3、 the sameNetwork Performance 2us RTT 20us RTTWorkload typesTightly coupled tasks with high inter-node communicationLoosely coupled tasks(e.g.,data parallelism)Model sizeVery large models that require significant memoryModels that can be split across nodesMemory ArchitectureShared memoryDistributed m
4、emoryNo shared global memoryParallelismRequired for PP and TP.Best for DPWhy do we need Scale up?https:/arxiv.org/pdf/2505.09343 Insights into DeepSeek-V3:Scaling Challenges and Reflections on Hardware for AI ArchitecturesScale OutScale Up SONiC Scale Up WGhttps:/lists.sonicfoundation.dev/g/SONiC-Sc
5、ale-Up-WGMicrosoft,Alibaba Invited Tencent and Bytedance to join Scale Up WGWeekly Meetings Every Tue 6-7pm PSThttps:/lists.sonicfoundation.dev/g/SONiC-Scale-Up-WG/wiki/39581Alibabas ThoughtsApplication ViewChip ViewLarge data packet sizeExpect to support 256 GPUs in the cluster with 512 GPUs as a s
6、tretch goal.Large BandwidthLow end to end latencyMatch HBM access with network bandwidth and packet sizeMaintenanceRack LevelMulti-tenant supportsEnhanced visibility via telemetryAir cooled SystemsLiquid cooled systemsMicrosofts ThoughtsApplication ViewNetwork ProtocolPacket sizes 1K 8kDepending on