《扩展和横向扩展人工智能架构:面向“系统之系统”的多态以太网架构.pdf》由会员分享,可在线阅读,更多相关《扩展和横向扩展人工智能架构:面向“系统之系统”的多态以太网架构.pdf(16页珍藏版)》请在三个皮匠报告上搜索。
1、Jai Kumar(Broadcom)Scale Up and Scale Out AI FabricsA Polymorphic Architecture for Converged Ethernet Fabric a Systems of System”IntroductionSize of the deep neural network has grown tremendously because of generative Large Language ModelsModel distributed inference(MDI)are used and are challenging
2、because of autoregressive natureKV caching and Grouped Query Attention(GQA)are being distributed as wellAll of this is demanding a Scaled Scale Up fabric for inferenceThis fabric CAN also be used for learning(with RMA semantics)This means that memory semantics need to work across large scale up doma
3、in and need to coexist with the network semantics of scale out domainQuestion we would try to answer is:How to create a polymorphic architecture that addresses the competing nature of scale up and scale out fabricsScaling Scale UP FabricHigh speed,low latency fabric designed to interconnect GPUs/Acc
4、elerators within a single server or rack scale system.Memory Semantics:GPU compute kernels directly perform data transfers to remote GPU HBM using load/store/atomic instructions as if they are local to the GPUAchieves lowest latency communication between GPUs.(A 128B write can GO in another GPUs LLC
5、/HBM in under 1uS)Less die area and power(bridging from NOC to link very thin 1:1 mapping,less data movement)Usually composed of high radix switch within a rack scaleMemory wall is a real issueCreate large Scale Up domain to break the memory wall and provide additional computeLoad/Store/Atomics stil
6、l need to operate across different physical memory domainsLatency is still importantFabric bandwidth is even more important for exchanging intermediate activation vectorsTier 2 switch 1Tier 2 switch Nload/store domainNeed to worry about congestion and multipathingSingle TierTwo TierConverged Etherne