《光学在蓬勃发展的AI_ML系统中的机遇与挑战.pdf》由会员分享,可在线阅读,更多相关《光学在蓬勃发展的AI_ML系统中的机遇与挑战.pdf(20页珍藏版)》请在三个皮匠报告上搜索。
1、Proprietary+ConfidentialOptics in Booming AI/ML Systems-a TPU-centric ViewCedric F.LamOn behalf of the Google Platforms Optics TeamIPEC Workshop of The Boost for AI:Next-Gen Optical Interconnects,OCP 2025 Global SummitOctober 13,12:00-3:00pm,SJCC,California02ProprietaryCelebrating 10 Years of TPU ev
2、olutionv11x/chip inferenceInternal inference accelerator2015v21x/chip1x/podDistributed shared memory2018v33x/chip12x/podLiquid cooled2020v46.6x/chip 100 x/podOptically reconfigurable 2022v5e4x/chip inferenceCost-efficiency for large-scale training and inferencev5p21x/chip750 x/podMost flexible AI ac
3、celerator2023Ironwood TPU7x:9,216 chips/podTPU7:256 chips/podCutting-edge chip Largest pod 2025Trillium 100 x v2 performanceEnabling the next frontier of AI models2024The Era of AI InfrastructureThe demand for ML compute is growing exponentiallyThe Interconnect BottleneckReference:AI and Memory Wall
4、Gap:3 orders of magnitudeAI/ML clusters are large distributed shared memory computing systems bottlenecked by interconnect bandwidths,High bandwidth,low-latency and lossless interconnects(scale-up and scale-out)are required for efficient memory sharing&high-performance AI/ML systems.ProprietaryThe O
5、pportunityLarge clusters with millions of TPU/GPU accelerators are arriving in the industry.Optical interconnects:Scale AI/ML system beyond the limit of copper links.Enable topology innovationImproves system reliability&flexibilityOptical ICI in TPU SuperpodsOptics in TPU ScalingTPUTPU chips per sup
6、erpodTopologyICI bandwidth per TPU chipICI Optical ModuleOptical lane rateOCS2018v22562D Torus800GB/sNoneN.A.None2020v310242D Torus800GB/s400Gbps AOC cable50GNone2022v440963D Torus600GB/s400G OSFP50GOCS2023v5p89603D Torus1200GB/s800G OSFP100GOCS2025v7 Ironwood92163D Torus1200GB/s800G OSPF 200GOCSBey