Meta 的大型语言模型服务基础设施的扩展.pdf

编号:981531 PDF 65页 5.37MB 下载积分:VIP专享
下载报告请您先登录!

1、Scaling Large Language Model Serving Infrastructure at MetaA comprehensive recipe to turn LLMs into LLM serving infrastructureYe(Charlotte)QiAI Inference MetaThe AI Gold RushCOMPUTECONTEXT WINDOWCOMPUTE#OF PARAMETERSInference Scaling and Compound Systems Are Comingcredit:https:/ been running model s

2、ervices for 6.5 years Ads model serving LLaMa servingMachine translation research before MetaBackground about Myself500MWe Support Product Backend for Meta AIMonthly active usersBehind the making of LLaMa“Should I run my own LLM services?”QuestionLets Build This Step By StepSummarize Charlottes post

3、s and ask follow-upsChallenge 1FittingChallenge 2Challenge 3Challenge 4STEP 1Find a good runtimeIsnt that just grabbing eval code?Imagine every output token generation triggers one model.forward!working on it!prefilldecodeContinuous BatchingThe Most Basic Features to Search For(Available in All Popu

4、lar Framework)KV Cache How does KV cache work?Imagine this sentence being generated by an LLM.KV tensors for yellow parts are cached in GPU memory at 320KiB/tok(LLaMa3-70B),128KiB/tok(LLaMa3-8B)under bf16.Prefilldec dec Prefilldec$dec dec eos dec dec dec dec dec dec dec dec dec dec Prefilldec dec de

5、c dec Prefilldec dec eos dec Not thisUse thisTGITensorRT-LLMeos RDMA 4-14xSTEP 2Understand hardware resourcesTCP 1xBack End NICFront End NICNVLinkNVLinkPCIePCIeCPUGPULets Only Worry About Model Loading40/80GBNVIDIA A10080/96GBNVIDIA H100192GBAMD Mi300 xSTEP 3:Start fitting some models with 80GB H100

6、 x8bf16:16GB 80GBLLaMa3-8BSTEP 3:Single-Card InferenceSTEP 3:Distributed Inference:Tensor ParallelismPartitioning Weightsbf16:140GB 80GB x 2LLaMa3-70BSTEP 3:Distributed Inference:Pipeline ParallelismPartitioning Weights Morebf16:810GB 80GB x 16bf16:810GB 192GB x 8LLaMa3-405BOr Find GPUs With Bigger

友情提示

1、下载报告失败解决办法
2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
4、本站报告下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。

本文(Meta 的大型语言模型服务基础设施的扩展.pdf)为本站 (竿头日上) 主动上传,三个皮匠报告文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知三个皮匠报告文库(点击联系客服),我们立即给予删除!

温馨提示:如果因为网速或其他原因下载失败请重新下载,重复下载不扣分。
客服
商务合作
小程序
服务号
折叠