《突破内存和 I_O 限制:面向 GenAI 时代的架构.pdf》由会员分享,可在线阅读,更多相关《突破内存和 I_O 限制:面向 GenAI 时代的架构.pdf(22页珍藏版)》请在三个皮匠报告上搜索。
1、Smashing Memory&I/O Limits:Architectures for the GenAI EraSmashing Memory&I/O Limits:Architectures for the GenAI EraRamin Farjadrad,CEO Eliyan CorporationSERVER:OPEN CHIPLETECONOMYPerformance Walls Power Wall,Bandwidth Walls:Memory&I/O WallsSBD:A novel D2D technology for superior Bandwidth/PowerImpl
2、ementation,silicon results,comparison High-performance D2D interconnects:The solution to Performance WallsAdvantages of 2.0D D2D:Long reach connectivity on large Std.substrates Multi-port Custom HBM(cHBM)to boost Memory&I/O bandwidthsPerformance limitations at system level,beyond the packageLeveragi
3、ng long reach 2.0D D2D to enable ultra high-BW AI or Switch ASICsGPU ASIC with 32TB/s memory bandwidth&50T-100Tbps I/O bandwidthSwitch ASIC with 400Tbps of aggregate bandwidthLeveraging SBD technology to double electrical port bandwidth:224G448GCall to ActionPresentation OutlineTop Bottlenecks in AI
4、 Performance:Memory WallDisparity in bandwidth gains vs compute gains is the top bottleneck for performanceTop Bottlenecks in AI Performance:Memory Wall&I/O WallDisparity in bandwidth gains vs compute gains is the top bottleneck for performanceMove to Chiplet-based System to Maximize Bandwidth/Power
5、Ultra high-performance AI systems need:10s of Tbps interconnect bandwidth between processor&memory devices The power for these interconnects must be ultra low to be practicalChiplet interconnects on a package substrate offer the best bandwidth/power ratio:1Tbps/WAI systems with superior BW/power nee
6、d very large pkg substrate with as many chiplets as possibleA Memory BW Wall Example Blackwell 200 B200 Compute Performance=10,000 TFLOPS for FP8.B200 Max Memory Bandwidth=8TB/s Up to 1250 FLOPs per every Byte from MemoryEven at very high arithmetic intensity of 500*,B200 will be utilized at 50%(=50