《评估CPO性能和可插拔光模块健康状况以实现可靠的AI基础设施.pdf》由会员分享,可在线阅读,更多相关《评估CPO性能和可插拔光模块健康状况以实现可靠的AI基础设施.pdf(15页珍藏版)》请在三个皮匠报告上搜索。
1、Siamak Amiralizadeh,Optical Engineer,Meta Viral Lowalekar,Optical Engineer,MetaEvaluation of CPO performance and pluggable optics health for reliable AI infrastructureAuthors:Siamak Amiralizadeh,Viral Lowalekar,Abhijit ChakravartyEvaluation of CPO performance and pluggable optics health for reliable
2、 AI infrastructureSiamak Amiralizadeh,Optical Engineer,MetaViral Lowalekar,Optical Engineer,MetaOCP SPECIAL FOCUS:PHOTONICSIntroductionLLM model sizes continue to increase driving more compute demand.I/O bandwidth scaling continues to lag compared to computeConnectivity in todays AI backend network:
3、Scale-up:Enabled by copper links relying on high-speed Serdes capabilities to connect GPUs in a rackScale-out:through pluggable optics connecting multiple racks togetherRef.:D.Alduino,“Optics in AI Clusters Meta Platforms Perspective,”Presented at OCP regional summit,Lisbon,Portugal,Apr.24-25,2024AI
4、 Network I/O ChallengesLarger models demand more compute and higher I/O BWAny wasted power for I/O means less power available for computeScale-out:Networking is projected to consume a larger share of AI cluster powerwith current trendsScale-up:Fundamental BW x Reach limitation of copper poses signif
5、icant challenges in building larger scale-up domainsThe simultaneous growth in GPU node count and bandwidth pushes the boundaries of rack design and powerAs AI clusters grow larger,link reliability becomes increasingly important and has a significant impact on training efficiency.LROLPO?Ref.:S.Amira
6、lizadeh and J.K.Doylend,“AI Networking Challenges a System Perspective,”JSTQE,2025Co-packaged Optics TechnologyS.Fathololoumi,et al.“1.6 Tbps Silicon Photonics Integrated Circuit and 800 Gbps Photonic Engine for Switch Co-Packaging Demonstration,”JLT,2021B.G.Lee et al.“Beyond CPO:A Motivation and Ap