《AWS 上的分布式推理:深入探讨推理优化.pdf》由会员分享,可在线阅读,更多相关《AWS 上的分布式推理:深入探讨推理优化.pdf(70页珍藏版)》请在三个皮匠报告上搜索。
1、 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Aman ShanbhagGenAI Specialist Solutions ArchitectWWSO AIML Frameworks AWSKeita WatanabeGenAI Speci
2、alist Solutions ArchitectWWSO AIML Frameworks AWSDistributed inference on AWS:Deep dive into inference optimizationsAIM353 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Challenges in LLM InferenceIs apple a fruit?Yes it is.UserChatBot 2025,Amazon Web Services,Inc.or its affiliat
3、es.All rights reserved.IsappleaFruit?TokenizationIsappleaFruit?Text generationYesitisEOSDetokenizationYesitisEOSIs apple a fruit?Yes it isUserChatBothttps:/ 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.Challenges in LLM InferenceWhat exactly is going on?Which hardware to use?Ho
4、w can I streamline?How to scale big model inference?Developerhttps:/ 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.N V I D I A G P U S,A N D A W S M L A C C E L E R A T O R SAccelerated computing portfolioTrainium acceleratorInferentia acceleratorB200,H200,H100,A100,L4,L40SA10G,
5、T4G5P4deG6P4dP5P5eGPUsP6-B200G6eInf2Trn1Trn2AWS ML chipsTrn3P5enP6e-GB200 2025,Amazon Web Services,Inc.or its affiliates.All rights reserved.LLM Inference Optimization StrategiesModel Architecture OptimizationKV Cache ManagementDistributed InferenceSystem OptimizationQuantizationAttentionMechanismOp
6、erator FusionScheduling&BatchingData ParallelPipeline ParallelTensor ParallelContext ParallelDisaggregated ServingExpert ParallelBasics of Text Generation InferenceText Generation InferenceTransformers ArchitecturePrefill vs.Decode and KV CacheKey MetricsMemory/Compute RequirementRoofline ModelAgend