1、Catalina-Metas latest AI/ML System OverviewMetaLJ ChenHardware Design Engineer/MetaCatalina-Metas latest AI/ML System OverviewAI CLUSTERSOutline 54321OverviewCatalina Compute TrayRack Management Controller(RMC)Q&ACall to ActionOverviewChallenges At Meta,were pushing the limits of AI innovation by bu
2、ilding cutting-edge AI/ML clusters that need to be updated quickly.One of our biggest challenges is managing the complex cooling systems required for these powerful machines.We believe that working together and sharing knowledge is crucial to making progress in this field.Catalina is a cutting-edge
3、AI/ML rack that uses the NVIDIAs Grace CPU and Blackwell GPU to support large-scale cluster training and inference applications.Our design prioritizes rapid time-to-market,alignment with industry standards,and the use of OCP building blocks.Weve also leveraged NVIDIA reference designs as possible to
4、 ensure seamless integration.Catalina marks several significant milestones for Meta,including:-First Large-Scale Liquid Cooled Deployment-Air Assisted Liquid Cooling or AALC-Rack Management Controller or RMC-ORv3 HPR Rack Deployment-The Canister,a mechanical chassis that is installed within the ORv3
5、 HPR rack,has rails that allow us to mount 19”trays.Catalina:Metas Next-Gen AI/ML Rack Catalina IT Rack LayoutORV3-140 kW+HPR RackA single IT rack consists of the following:Compute Trays which include the CPU and GPUs.Rack Management Controller(RMC)which monitors for leaks and controls power to the
6、the other equipment in the rack.NVSwitches which are used to form the NVlink fabric between the GPUsCable Cartridges in the rear of the rack to provide the physical NVlink connectionsLiquid cooling Manifold which is used for the AALC liquid connectionsWedge400 are used to connect up the front end ne