《将 OCP 网卡扩展到 1.6T 及以上以支持人工智能.pdf》由会员分享,可在线阅读,更多相关《将 OCP 网卡扩展到 1.6T 及以上以支持人工智能.pdf(17页珍藏版)》请在三个皮匠报告上搜索。
1、Damien Chong,MetaHemal Shah,BroadcomScaling OCP NIC to 1.6T and beyond for AIScaling OCP NIC to 1.6T and beyond for AIDamien Chong,Hardware Tech Lead,MetaHemal Shah,Distinguished Engineer and Architect,BroadcomSERVER:AI HW SW CO-DESIGN/NIC/HPCThis presentation discuss:NIC in AI Backend NetworkNIC 1.
2、6T+CharacteristicsPath&challenges to OCP NIC 1.6T and beyondPreviewAI Infrastructure Network ConnectivityScale OutNetworkScale UpNetworkInternalConnectivityCPUXPUNICNVMeSSDNICXPUCPUNVMeSSD.High-Bandwidth :800G and aboveLarge scale:100K-1M XPUsMessaging SemanticsUltra Low LatencySupports Peer-2-Peer
3、Data TransferXPU-XPU ConnectivityMemory SemanticsNICs in AI systemsCPUNICCPUNICCPU Front-End Network-Send&Receive Data-Pre-process Data-Schedule JobsGPU Scale-Out Network-Parallel GPU-GPU Compute beyond single rack-Model Training*GPU Scale-Up Network within rack typically not by NIC is not part of d
4、iscussion todayGPUGPUNICGPUGPUGPUGPUGPUGPUNICNICNICZoom into Front-End NIC for AI systemsCPUNICCPUNICCPU Front-End Network-Send&Receive Data-Pre-process Data-Schedule JobsMedium traffic intensity satisfy with 400G/800G NIC that is well supported by OCP NIC SFF/TSFFNext-gen expand to 1.6T and/or Liqu
5、id CoolingZoom into Back-End NIC for AI systemsGPU Scale-Out Network-AI racks are becoming increasingly dense because scale-up in-rack network provide much higher bandwidth compared to scale-out rack-to-rack network-Dense AI rack also squeeze area available for Scale-out network solution-Desire&dema
6、nd high GPU-to-GPU interconnectivity drive high Scale-out bandwidth*GPU Scale-Up Network within rack typically not by NIC is not part of discussion todayBandwidth per Area efficiency is ImportantGPUGPUNICGPUGPUGPUGPUGPUGPUNICNICNICPCIe Gen 6 and above host interface(x16 or x32 or x48 or x64 lanes co