3323 - 使用 vLLM 和 Red Hat AI 优化大型语言模型以进行推理.pdf-在线下载-三个皮匠报告

1、Orlando,FLOctober 69IBM TechXchange 2025Session code 3323Carlos Condado,Sr.Product Marketing ManagerChristopher Nuland,Principal Technical Marketing ManagerRed HatOptimizing LLMs for Inference with vLLM and Red Hat AIWhat you will learn in this session0102030405Inference optimization principlesMaxim

2、ize performance and cut costsToken-based distributed inference predictable performanceTrack and meet inference SLOsRed Hat AI:open,enterprise AI platformIBM TechXchange|2025 IBM CorporationGenerative AI is transforming industries but inference-related processes increase complexities and costsIBM Tec

3、hXchange|2025 IBM Corporation3The AI promise vs.The operational realityIBM TechXchange|2025 IBM Corporation4The ripple effect across your teams and businessSlow innovationMissed revenue opportunitiesHigh costsManaging siloed solutionsLimited scalabilityDeployment frictionUnderperforming modelsUnreli

4、able experienceThe orchestrators and builders of AI apps and agents5Inference optimization principlesInference optimization principlesHigh-performant inference runtimeQuantized modelsFast and cost-effective inference6NeuronTPUGaudiInstinctGPULlamaQwenDeepSeekGemmaMistralMolmoPhiNemotronGraniteSpyrev

5、LLM is the inference runtime for the hybrid cloudEdgePrivate CloudPhysicalVirtual Public Cloud7OpenAI introduced gpt-ossOn Aug 5th,2025 vLLM had Day 0 support for gpt-oss,on NVIDIA&AMD GPUs8Meta introduced Llama 4On April 5th,2025 1.vLLM had Day 0 support for llama 4&2.Meta quantized the FP8 version

6、 using Red Hats open source LLM Compressor9vLLM is the inference runtime for the hybrid cloudNative Hugging Face integrationSimple APIs for online and offline inferenceOpenAI-compatible API protocolAdvanced algorithms for high QPS servingSingle server/GPU to distributed/multi GPUKV cache optimizatio