1、Reference Implementation of an SDN Controller for Open-optical-circuit Switched AI ClustersNTT Network Innovation Labs,JapanReference Implementation of an SDN Controller for Open Optical-circuit-switched AI ClustersKazuya AnazawaResearcher/NTT Network Innovation Labs.OPTICAL COMMUNICATION NETWORKSBa
2、ckground:model size vs.switch capacityCONTACT Software Blog,Big bigger giant,the rise of giant AI models2025-2026Tomahawk 6102.4Tbps2023.3Tomahawk 551.2Tbps2020.12Tomahawk 425.6TbpsTomahawk 312.8Tbps2017.12Broadcom press releasesPacket Switch Capacitydoubling every 2-3 yearsAI Model Size10-20 times
3、per year!3 OCS is key for realizing scalable and power-efficient cluster as well as CAPEX reductionPacket SwitchPacket SwitchPacket SwitchPacket SwitchPacket SwitchPacket SwitchAcceleratorBaseboard800GTRX800GTRX800GTRXTODAY:Bandwidth-constrained interconnectAcceleratorBaseboard800GTRX800GTRX800GTRXP
4、ossible Solution:Bandwidth-free interconnect by OCSsFew transceivers required.Packet SwitchOCSBandwidth FreePacket SwitchOCSBandwidth FreeMulti-vendorOCS ControllerIntroduction of OCSs to AI infrastructureOptical Circuit SwitchTodays focus4Efficient networking among GPUs in multi-tenant environment
5、is necessary for GPU providers.No reference model for AI interconnect(both HW and SW)Issues on AI infrastructureAcceleratorBaseboardAcceleratorBaseboardGPUGPUGPUGPUGPUOCS,Switch Electrical Packet SwitchTODAY:Fixed GPU allocation for boardsPossible solution:Elastic GPU allocation(e.g.,using OCSes)Hig
6、h-bandwidth DomainHigh-bandwidth DomainGPUGPUAcceleratorBaseboardGPUGPULack of goodreference modelNo longer scale5AcceleratorBaseboardAcceleratorBaseboardGPUGPUGPUGPUGPUOCS,Switch Electrical Packet SwitchTODAY:Fixed GPU allocation for boardsPossible solution:Elastic GPU allocation(e.g.,using OCSes)H