《UALink Scale-Up Pod 的管理和控制.pdf》由会员分享,可在线阅读,更多相关《UALink Scale-Up Pod 的管理和控制.pdf(13页珍藏版)》请在三个皮匠报告上搜索。
1、Justin King,AMDArun Satyanarayana,GoogleManagement and Control of UALink Scale-Up PodsManagement and Control of UALink Scale-Up PodsJustin King,AMDArun Satyanarayana,GoogleHardware ManagementUALink connects Accelerators in a scale-up fabricLoad,store,and atomic operationsLow-latency,high bandwidthUA
2、L200 leverages ethernet for physical layer re-use cables,retimersUALink defines Data Link,Transport and Protocol layers above the physical layerUALink OverviewAccSwitchAccSystem Node 1System Node 0HBMHBMHBMHBMDDRDDRDDRDDRUALinkHostCXL/PCIe/XGMI/CHI c2c/Etc.HostUALink PodA UALink Pod consists of:Syst
3、em Nodes with a Host CPU and AcceleratorsSwitch Platforms with UALink Physical SwitchesUALink Pods are designed to scale up to 1024 acceleratorsWide variety of system designs is encouraged!UALink Switch Platform 3UALink SwitchUALink Switch Platform 2UALink SwitchUALink Switch Platform 1UALink Switch
4、UALink PodUALink System Node 2AccAccAccAccCPUCPUNICUALink System Node 3AccAccAccAccCPUCPUNICUALink System Node 1AccAccAccAccCPUCPUNICUALink Virtual PodsA Virtual Pod(vPod)is the unit of isolation and workload schedulingAt least one vPod is required for workload schedulingA Pod may be partitioned mul
5、tiple vPodsThe largest vPod is a full PodvPods are typically created to support multiple tenants and/or differential workloadsE.g.,Multiple models for inferencingvPods are isolated from one another via routing entries in each Physical SwitchUALink Switch Platform 3UALink SwitchUALink Switch Platform
6、 2UALink SwitchUALink Switch Platform 1UALink SwitchUALink PodUALink System Node 2AccAccAccAccCPUCPUNICUALink System Node 3AccAccAccAccCPUCPUNICUALink System Node 1AccAccAccAccCPUCPUNICVirtual Pod 1Virtual Pod 2Virtual Pod 3Centralized Control and ManagementPod Controller sets up and manages a UALin