1、Scaling Large Language Model Serving Infrastructure at MetaA comprehensive recipe to turn LLMs into LLM serving infrastructureYe(Charlotte)QiAI Inference MetaThe AI Gold RushCOMPUTECONTEXT WINDOWCOMPUTE#OF PARAMETERSInference Scaling and Compound Systems Are Comingcredit:https:/ been running model s
2、ervices for 6.5 years Ads model serving LLaMa servingMachine translation research before MetaBackground about Myself500MWe Support Product Backend for Meta AIMonthly active usersBehind the making of LLaMa“Should I run my own LLM services?”QuestionLets Build This Step By StepSummarize Charlottes post
3、s and ask follow-upsChallenge 1FittingChallenge 2Challenge 3Challenge 4STEP 1Find a good runtimeIsnt that just grabbing eval code?Imagine every output token generation triggers one model.forward!working on it!prefilldecodeContinuous BatchingThe Most Basic Features to Search For(Available in All Popu
4、lar Framework)KV Cache How does KV cache work?Imagine this sentence being generated by an LLM.KV tensors for yellow parts are cached in GPU memory at 320KiB/tok(LLaMa3-70B),128KiB/tok(LLaMa3-8B)under bf16.Prefilldec dec Prefilldec$dec dec eos dec dec dec dec dec dec dec dec dec dec Prefilldec dec de
5、c dec Prefilldec dec eos dec Not thisUse thisTGITensorRT-LLMeos RDMA 4-14xSTEP 2Understand hardware resourcesTCP 1xBack End NICFront End NICNVLinkNVLinkPCIePCIeCPUGPULets Only Worry About Model Loading40/80GBNVIDIA A10080/96GBNVIDIA H100192GBAMD Mi300 xSTEP 3:Start fitting some models with 80GB H100
6、 x8bf16:16GB 80GBLLaMa3-8BSTEP 3:Single-Card InferenceSTEP 3:Distributed Inference:Tensor ParallelismPartitioning Weightsbf16:140GB 80GB x 2LLaMa3-70BSTEP 3:Distributed Inference:Pipeline ParallelismPartitioning Weights Morebf16:810GB 80GB x 16bf16:810GB 192GB x 8LLaMa3-405BOr Find GPUs With Bigger