1、杨珂 趋境科技技术专家|Mooncake 核心贡献者MooncakeMooncake:Ke Yang,Approaching.AI Tech Expert|Mooncake Core Contributor解耦式架构和以存换算,优化大模型推理解耦式架构和以存换算,优化大模型推理目 录CONTENTS Background:LLM Inference in Long-contex xt EraMooncake:A KVCache-centric Disaggregated ArchitectureMooncake LL M Ecosystem CollaborationCurrent Parad
2、igm:Data+Algorithm+Hardware=IntelligenceAlgorithm-Transformer is all we need?Data Big Data is EverywhereHardware Huangs Law Take OverIntelligence AI Become Everywhere TooThe Old Scaling Law is Slowing downBUT,who use it?LargerModelMoreDataGrowingComputing PowerThe Old Scaling LawThe Old Scaling LawP
3、erformance gains from adding more parameters are increasingly limited.It is becoming difficult to gather enough high-quality data to feed ultra-large models.Everyone is Talking about Scaling Law But the Real Question is What to Scale?https:/ Data+Larger Model+Longer Context=Higher IntelligenceIn Jan
4、uary 2025,DeepSeek R1 quickly rose to become one of the most renowned large model services for its strong reasoning(long-output)capability.Long input-KimiLong output DeepSeek R1In March 2024,Kimi became one of the leading large model services thanks to its strong long-context(long-input)processing c
5、apability.More Data+Larger Model+Longer Context=Higher IntelligenceChain-of-ThoughtMore Data+Larger Model+Longer Context=Higher IntelligenceAI applications are evolving from simple chat to complex agent-based systems.Single-turn,short inputs/outputsMulti-turn,complex execution topologies,long inputs
6、/outputs.More Data+Larger Model+Longer Context=heavier workloadHiger Inference CostLonger Response TimeLack of Computing and Memory ResourcesOne of the key bottlenecks in the long-context era:Inference costs are skyrocketingAmazon reports that over 90%of costs come from inference rather than trainin