《加速大规模人工智能数据处理:提升效率和可持续性.pdf》由会员分享,可在线阅读,更多相关《加速大规模人工智能数据处理:提升效率和可持续性.pdf(18页珍藏版)》请在三个皮匠报告上搜索。
1、Gaurav AgarwalNilesh ShahParvez ShaikAccelerating AI Data Processing At ScaleAccelerating AI Data Processing At ScaleGaurav Agarwal-Distinguished Engineer,MarvellNilesh Shah VP Business Development,Zeropoint TechnologiesParvez Shaik Sr.Director Product Mgmt.and Engg.,RambusIT INFRASTRUCTUREAI Scaled
2、 Dataset Challenges(Recap)Many workloads are non-uniform and memory boundCompute Scale Bounded by Memory Hitting the“memory wall”Improvements in processing speed far exceeds memory speedLow bandwidth/capacity per compute coreScaling up loads on compute processor hits peak memory performanceBandwidth
3、-ConstrainedCapacity-ConstrainedMemory requirements continue to growAI Models parameters growing faster than single xPU memoryRequires large number of xPUsPoor xPU computing utilizationSignificantly diverse infrastructure demands of AI workloadsTightly coupled memory and computeMain memory dominates
4、 data center costFlexible compute and memory is needed for efficient and scalable solutionsSourceRef:Dan R Meta at OCP 2023Composable Heterogenous Compute ComponentsCPUCPUStructera A 2504eHSMInline CompressionDecryptionDMADDR5 controllers(De-)encryption16 ArmNeoverse V264 MB Last level cacheCXL Acce
5、leratorsMemoryInterconnectStructera A 2504 compute complex16 3.2GHz Server Class Arm Neoverse V212GB/s Memory Bandwidth per CoreInline LZ4 compressionXTS 256-bit encryption/decryptionArmv9.0A and Armv8.5-A A64 instruction setsSingle Instruction Multi Data(SIMD)&FPScalable Vector Unit(SVE/SVE2)RAS Ex
6、tensionsCryptographic extension64KB L1(i)/64KB L1(d)/2MB L2/64MB LLCScalable Composability w/CXL AcceleratorsCPU200 GiB/s Memory BW200 GiB/s Memory BW64GiB/s64GiB/sHostCXL AcceleratorsOverall SystemTotal Cores Count643296Aggregate DRAM BW(GB/s)400400800Memory BW/Core(GiB/s)6.2512.5-Aggregate Memory