《会议5_使用 MVAPICH 中的混合 GPU 压缩来扩展大型语言模型训练.pdf》由会员分享,可在线阅读,更多相关《会议5_使用 MVAPICH 中的混合 GPU 压缩来扩展大型语言模型训练.pdf(22页珍藏版)》请在三个皮匠报告上搜索。
1、SCALING LARGE LANGUAGE MODEL TRAINING USING HYBRID GPU-BASED COMPRESSION IN MVAPICHAamir Shafi,Research ScientistLang Xu,Ph.D.StudentNetwork Based Computing LaboratoryThe Ohio State Universityhttp:/nowlab.cse.ohio-state.edu/2024 OFA Virtual WorkshopFollow us onhttps:/ 2024 Virtual OFA Workshop2Netwo
2、rk Based Computing Laboratory Introduction&Background Motivation&Challenges Hybrid Compression Design Performance Evaluation ConclusionPresentation Outline2024 Virtual OFA Workshop3Network Based Computing LaboratoryLarge Language Models(LLaMA2,GPT4,Claude3)are powerful in various areas(dialogue syst
3、ems,knowledge base,)Model capability scales with number of parameters(100 Million BERT to 500 Billion Megatron-Turing NLG)Training Billion parameter models requires:Parallelism strategies(scaling up to thousands of GPUs)Memory optimization(fitting models within GPUs)Efficient communication(reducing
4、interconnect bandwidth pressure)Training Large Language Model2024 Virtual OFA Workshop4Network Based Computing LaboratoryParallelism StrategiesData Parallelism(DP):Maintains full model replica on each DP rank and takes mini-batch as inputData-intensive gradient synchronization using AllreducePipelin
5、e Parallelism(PP):Shards model layers across devices and executes in a pipeline orderPoint-to-point communication passing activations and gradientsTensor Parallelism(TP):Distributes Matrix Multiplication over different devicesFrequent Allreduce and Allgather communication ensuring correctness3D Para
6、llelism combines DP+PP+TP(Megatron-LM)2024 Virtual OFA Workshop5Network Based Computing LaboratoryMemory OptimizationDeepSpeed ZeRO Optimizer:A novel memory optimization technology for large-scale distributed deep learningEnables training models with billions of parameter among GPUEach GPU only upda