《以与 GPU 无关的方式为开放式 AI 系统启用 IBGDA 支持.pdf》由会员分享,可在线阅读,更多相关《以与 GPU 无关的方式为开放式 AI 系统启用 IBGDA 支持.pdf(15页珍藏版)》请在三个皮匠报告上搜索。
1、Eddie WaiEnable GDA Support in a GPU-agnostic Manner for Open AI SystemsIBGDA is an extension to the GPUDirect familyGPUDirect enables direct GPU memory placement from a peer device GPUDirect Async enables a GPU to directly initiate the transferIBGDA allows not only the GPU to initiate but also carr
2、ies out the transfer workIBGDA improves message rates for small message size packetsAs compared to the CPU proxy conduit used without IBGDAIBGDA optimizes Prefill-Decode(PD)disaggregation phases used for inferencePD disaggregation splits the Prefill and the Decode work into separate GPU nodesIBGDA h
3、elps to improve the communication efficiency between these separate nodesWhat is IBGDA and how does it help?1.GPU produces data in HBM2.GPU writes a work request to the proxy buffer3.CPU calls into the NIC userlib to post_send4.NIC userlib translates WR to WQE and rings DB5.NIC reads the WQE from th
4、e SQ6.NIC DMAs payload data from GPU memory7.NIC sends data over the network8.NIC updates the SCQ9.CPU polls CQ for completion10.CPU notifies GPU of completionNon-GDA Fast Path1.GPU produces data in HBM2.NIC kernel directly writes a WQE to the SQ3.NIC kernel rings DB4.NIC reads the WQE from the SQ5.
5、NIC DMAs payload data from GPU memory6.NIC sends data over the network7.NIC updates the SCQ8.NIC kernel polls CQ for completionGDA Fast PathNo CPU interventionGDA ResultsAll-to-all latencyGPUs x NodesRanks 8,16,32RC Latency increases proportionally as the number of ranks increase22us,28us,51usGDA la
6、tency increases minimally up to the number of parallel executed SMs32us,33us,35usNon-standard APIsParallel Executionthreads,warps/waves,thread blocksConcurrent Work building and Completion handlingSynchronization and Locking mechanicsMemory coherency issuesAtomic Ops support in some GPUsDoorbell Upd