1、Orlando,FLOctober 69IBM TechXchange 2025Session 1408Kyle BaderChief Architect,Data and AI,CephIBM StorageIBM Storage Ceph in theWorld of AI/ML WorkloadsAgenda hetero inferenceprefill/decode disaggkv caching(enabler)-optimize tcoIBM TechXchange|2025 IBM Corporation3What is inference KV caching?IBM Te
2、chXchange|2025 IBM Corporation4GPT,Can you summarize this document?PDFIBM Granite 3 paper:30k tokensAttention is All You Need paper:8k tokensPrefillDecodeBuild tensors representing prompt contextGenerate response to promptPrefill rate decayIBM TechXchange|2025 IBM Corporation5PrefillDecodePrefill Ra
3、te(tokens/second)TensorTokensUpdate weightsacross all layersWith each newtoken,reducingprefill rateO(n2)attentioncomplexityTime to first tokentLMCache BlockIBM TechXchange|2025 IBM Corporation6Long contextCache block 1Cache block 2LMCache defaultcache block sizeis 256 tokensSelected Model:Qwen/Qwen3
4、-32BHidden Size:5120Number of Attention Heads:64Number of Hidden Layers:64Number of Key-Value Heads:8Head Size:80(Hidden Size/Attention Heads)Data Type Size:2 bytesTotal Elements:2 64 256 8 80=20971520Total Bytes:20971520 2=41943040 bytesKV Cache Size:41943040/(1024)0.0391 GBKV Cache Size Calculator
5、Space for timeIBM TechXchange|2025 IBM Corporation7Prefill(compute)DecodecachetSpeedupIf cache blocks canbe loaded from storagefaster than they can becomputed we reducetime-to-first-tokenParallelIBM TechXchange|2025 IBM Corporation8Prefill(compute)DecodetSpeedupWe can load smallercache blocks in par
6、allelto further reduce thetime-to-first-tokenComputed prefillprogresses sequentiallycachecacheArchitectureIBM TechXchange|2025 IBM Corporation9vLLMDynamoNIXLCeph RGWRequests cacheblocksManage KV Cacheblocks,cache logicHigh performanceIO layerCache blockpersistenceS3 via obj backe