1、JM HandsMatt RomanMarc AustinDesign,Build and Test an OCP AI Cloud Network with Industry Leading PerformanceNETWORKINGOCP SPECIAL FOCUS:ARTIFICIAL INTELLIGENCE(AI)JM HandsMatt RomanMarc AustinDesign,Build and Test an OCP AI Cloud Network with Industry Leading PerformancePanel DiscussionJM HandsCEO,F
2、armGPUMatt RomanSr Director,PLMCelesticaMarc AustinCEO,HedgehogDesignOCP NetworkingLearn more2U 64-port 800GbE Data Center SwitchAI/ML&Big Data AnalyticsHyperscale Data Centers&Cloud ComputingHigh-Performance Computing(HPC)Network Backbone(800GbE Data Center Leaf/Spine)NETWORKINGCelestica DS5000800G
3、bE SwitchOCP Networking SoftwareNETWORKINGBuild17 Day Crash Course on AI Networking17 DaysMay 23Aug 1Aug 20Jul 16Jul 17 Aug 1Aug 15July 17Equipment OrderedEquipment OrderedNCCL TestNCCL TestEquipment On SiteEquipment On SiteOptics IssueOptics IssueLots of CollaborationLots of CollaborationGo LiveGo
4、LiveSold OutSold OutAI Network is a Lot More Than a SwitchComponentLesson LearnedBetter Next TimeCablingEasy to make mistakes,different types of MPO,dust,etc.Use host and switch software to confirm cablingOpticsVery little interoperability,need to validate EVERY optic with switchValidate BOM to ensu
5、re compatible optics.Management software to provide detailed optics status.Software to identify anomalies.BIOSDisable IOMMU and PCIe ACS for max performance on NCCLManagement software to validate host BIOS settingsOS kernelBlackwell NVIDIA driver workaround for Ubuntu 24.04/Kernel 6.8Management soft
6、ware to validate versions and check known issuesDriversMellanox OFED drivers,RDMA setup,Blackwell supportManagement software to automate configuration of host networkingKernel modulenvidia-peermem,DOCA(See above)NICMST tools,disable autoneg,400G force link,tur