1、 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.A I M 4 1 4Performance engineering on Neuron:How to optimize your LLM with NKIScott PerryPrincipal Solutions Architect,AI/ML PerformanceAnnapurna Labs,AWSSadaf Rasoo
2、lSolutions Architect,AI/ML PerformanceAnnapurna Labs,AWS 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Innovating at the silicon levelInnovating at the silicon level3AWS TrainiumAWS InferentiaAWS AI Chips 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.AWS AI
3、Chipsfor Generative AIAWS Inferentia AWS Inferentia2 AWS Trainium AWS Trainium2Deep learning modelsMedium to large-scale inferenceLLMs,multi-modal modelsMedium to large-scale training and inference:LLMs,multi-modal modelsTraining and inference for Gen AI modelsAWS Trainium3AWS AI ChipsNext-gen agent
4、ic,reasoning,and video generation applications 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.NeuronCore ArchitectureHBMNeuronCoreGPSIMD EngineScalar EngineVector EngineTensor EngineDMA EnginesPSUMSBUFHost(CPU)Memory 2025,Amazon Web Services,Inc.or its affiliates.All rights reser
5、ved.Memory HierarchySRAMAccelerator HBMHost Memory-Size:MBs-Bandwidth:10TB/s-Size:10s GBs-Bandwidth:TB/s-Size:10s GBs TBs-Bandwidth:GB/s 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.How do we improve performance?Pipeline operations Minimize data movement Maximize data throughpu
6、t Collectives time compute/data ops timecompute boundPerformanceArithmetic Intensity(ops/byte)2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.ML DevelopersData ScientistsPerformance EngineersNeuron Developer Stack 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.