1、千亿参数千亿参数 LLM LLM 的训练效率优化的训练效率优化演讲人:张力寰零一万物/AI Infra分布式训练优化架构师CONTENTS目 录01模型训练硬件利用率影响因素02分布式训练效率提升03FP8 训练经验分享04MoE 训练经验分享05Goodput提升06总结与展望模型训练硬件利用率影响因素Llama 3.1Llama 3.192页的技术报告最大 405B 的模型16K H100 训练集群54天的预训练,466次任务中断MFU:40%左右Goodput:90%左右的有效训练时间占比https:/arxiv.org/pdf/2407.21783MFUMFUModel FLOPS U
2、tilization,模型算力利用率FLOPS(Floating Point Operations Per Second)?=?怀?https:/arxiv.org/pdf/2407.21783https:/ metric to measure AI system efficiency(Google)Scheduling GoodputRuntime GoodputProgram Goodputhttps:/ ParallelismData Parallelismhttps:/ Data Parallel模型较小,数据量较大Tensor ParallelismTensor Parallelis
3、mhttps:/arxiv.org/pdf/1909.08053Megatron-LM-1拆分均匀,但通信量大Pipeline ParallelismPipeline Parallelismhttps:/arxiv.org/pdf/1806.03377https:/arxiv.org/pdf/1811.06965https:/arxiv.org/pdf/2104.04473GPipe(Google),PipeDream(Microsoft)Megatron-LM-2通信量低,但会引入bubbleExpert ParallelismExpert ParallelismSwitch Transfo
4、rmers(Google)Megatron Expert Parallelismhttps:/arxiv.org/pdf/2101.03961https:/ ParallelismContext Parallelismhttps:/ Ring AttentionAttentionRing Attention with Blockwise Transformers for Near-Infinite Context(UC Berkeley)(a)Outer loop:computing blockwise attention among devices(b)Inner loop:every de
5、vice computes blockwise attention and feedforward operationshttps:/arxiv.org/pdf/2310.01889Ring AttentionRing Attention基本原理:Online softmax性能问题:负载不均衡Ring Ring AttentionAttention原版 VS 负载均衡版SWA+CPSWA+CPSliding Window Attention+Context Parallel不同情况序列长度的问题如何复用高性能Attention算子(如FlashAttention)模基共建如何与FullAtt
6、ention+CP混合使用其它优化其它优化通信计算并行分布式优化器显存优化TP的MPI依赖解耦.FP8训练经验分享FP8 FP8 训练训练简介简介https:/ sign bit,4 exponent bits and 3 bits of mantissa.+/-448 and nan.E5M2:1 sign bit,5 exponent bits and 2 bits of mantissa.+/-57344,+/-infFP8 FP8 训练训练简介简介FP8 混合精度训练部分计算采用FP8前向用 E4M3,反向