1、 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.AIM 3338Anirudh Viswanathanhe/himSr.Product Manager,TechnicalAmazon Web ServicesArun Nagarajanhe/himPrincipal Software EngineerAmazon Web ServicesAntonio Ginarthe/hi
2、mPrincipal Research ScientistSalesforceCheckpointless&elastic training for AI models Amazon SageMaker HyperPod 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.AgendaIntroduction to Amazon SageMaker HyperPodLarge-scale model trainingElastic training on HyperPodCheckpointless traini
3、ng on HyperPodSalesforce AI ResearchTakeawaysResources 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Introduction toAmazon SageMaker HyperPod 202
4、5,Amazon Web Services,Inc.or its affiliates.All rights reserved.Scale and accelerate generativeAI model development across thousands of AI acceleratorsAmazonSageMakerHyperPodImproved efficiencyTools for maximizing compute resources utilization,advanced observability,and seamless cluster customizatio
5、nReduced time-to-trainResilience features and distributed training libraries help reduce timeto train by up to 40%Lower costsLess time invested in hardware maintenance and more efficient cluster engagement reduces FM training TCO 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Hyp
6、erPod benefitsScalableScalable Single-spine node topology Pre-configured EFA for optimal inter-nodecommunication speeds Flexible paths to securing compute capacity Rapid cluster scale-up without performance degradationResilient Proactively screen health of inbound nodes Continuous cluster hardware m