当前位置:首页 > 报告详情

3323 - 使用 vLLM 和 Red Hat AI 优化大型语言模型以进行推理.pdf

上传人: 竿*** 编号:982933 2025-11-29 27页 1.97MB

1、Orlando,FLOctober 69IBM TechXchange 2025Session code 3323Carlos Condado,Sr.Product Marketing ManagerChristopher Nuland,Principal Technical Marketing ManagerRed HatOptimizing LLMs for Inference with vLLM and Red Hat AIWhat you will learn in this session0102030405Inference optimization principlesMaxim

2、ize performance and cut costsToken-based distributed inference predictable performanceTrack and meet inference SLOsRed Hat AI:open,enterprise AI platformIBM TechXchange|2025 IBM CorporationGenerative AI is transforming industries but inference-related processes increase complexities and costsIBM Tec

3、hXchange|2025 IBM Corporation3The AI promise vs.The operational realityIBM TechXchange|2025 IBM Corporation4The ripple effect across your teams and businessSlow innovationMissed revenue opportunitiesHigh costsManaging siloed solutionsLimited scalabilityDeployment frictionUnderperforming modelsUnreli

4、able experienceThe orchestrators and builders of AI apps and agents5Inference optimization principlesInference optimization principlesHigh-performant inference runtimeQuantized modelsFast and cost-effective inference6NeuronTPUGaudiInstinctGPULlamaQwenDeepSeekGemmaMistralMolmoPhiNemotronGraniteSpyrev

5、LLM is the inference runtime for the hybrid cloudEdgePrivate CloudPhysicalVirtual Public Cloud7OpenAI introduced gpt-ossOn Aug 5th,2025 vLLM had Day 0 support for gpt-oss,on NVIDIA&AMD GPUs8Meta introduced Llama 4On April 5th,2025 1.vLLM had Day 0 support for llama 4&2.Meta quantized the FP8 version

6、 using Red Hats open source LLM Compressor9vLLM is the inference runtime for the hybrid cloudNative Hugging Face integrationSimple APIs for online and offline inferenceOpenAI-compatible API protocolAdvanced algorithms for high QPS servingSingle server/GPU to distributed/multi GPUKV cache optimizatio

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
根据《Optimizing LLMs for Inference with vLLM and Red Hat AI》的内容,以下是全文关键点的概括: 1. **LLM推理优化**:通过vLLM和Red Hat AI平台,优化大型语言模型(LLM)的推理性能,降低成本。 2. **性能提升**:vLLM支持OpenAI的gpt-oss和Meta的Llama 4,实现快速、高效的推理。 3. **模型压缩**:使用Red Hat的LLM Compressor,通过量化模型减少内存和计算需求,提高效率。 4. **分布式推理**:支持基于token的分布式推理,优化性能和成本。 5. **Red Hat AI平台**:提供开放、企业级的AI平台,支持混合云环境。 6. **社区贡献**:Red Hat与UC Berkeley等机构合作,推动vLLM和LLM Compressor的发展。 7. **模型优化**:提供多种优化后的模型,如Llama 3.1 8B、70B和70B-FP8,平衡性能和成本。 8. **AI代理构建**:Red Hat AI支持使用Llama Stack构建AI代理,并集成到OpenShift AI中。 9. **可扩展性**:通过Kubernetes和vLLM,实现AI工作负载的动态扩展和资源管理。
揭秘性能提升秘诀" 如何降低AI成本?" 构建高效AI应用的利器"
客服
商务合作
小程序
服务号
折叠