1、 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.A I M 3 6 8Unifying AI/ML Operations with SageMaker HyperPod and Amazon EKSAlex IankoulskiPrincipal Specialist Solutions Architect,GenAIAWSApoorva KulkarniPrincipal
2、Specialist Solutions Architect,ContainersAWS 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Introduction 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.SageMaker HyperPod and Amazon EKSUse Case
3、sDeploymentCompute,network,storageDemoInteractive discussion throughoutAgenda 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.AI Compute DemandData source:Epoch.AI(2025)1 FLOP a single floating-point operation1 teraFLOP 1 trillion(1012)FLOPs 1 petaFLOP 1 quadrillion(1015)FLOPs1 ex
4、aFLOP 1 quintillion(1018)FLOPs,or 1,000 petaFLOPs1 zettaFLOP 1 sextillion(1021)FLOPs,or 1,000,000 petaFLOPs1 yottaFLOP 1 septillion(1024)FLOPs,or 1,000,000,000 petaFLOPsTotal petaFLOP100 billion petaFLOPs 100 yottaFLOPs 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.AI WorkloadsW
5、ORKLOADPRE-TRAININGFINE-TUNINGINFERENCEOutcomeNew foundation modelsFM adapted toproprietary dataEnd-user AI applicationsGPU hours neededTens of millionsThousands to millionsThousands to millions ofrequestsNeedsMassive amounts ofscalable,performant,and resilient computeTurn-key solutions forlaunching
6、 tuning jobsand acquiring computeLow latency,high-throughput,and dynamic model-serving 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Hardware failures disrupt AI workloads and increase costs GPU instances are susceptibleto hardware failures(memory,thermal,bus,etc.)Hours or days