《马介悦-DLRover在万卡规模大模型训练中的稳定性实践.pdf》由会员分享,可在线阅读,更多相关《马介悦-DLRover在万卡规模大模型训练中的稳定性实践.pdf(34页珍藏版)》请在三个皮匠报告上搜索。
1、演讲人:马介悦010203040506IntroductionDLRoverFlash CheckpointXPUTimerOpen SourceQ&A01Training DataModelingPretrainFinetuningAI APPsChallenges for AI Infra from End to EndModel size growing by scaling lawNASOSSOnline learning02 弹性训练组网&Precheck训练容错资源管理DLRover核心能力 03方案背景核心功能异步持久化断点续存04大规模训练疑难问题XPUTimer核心能力Err
2、orSlowdownAlgorithmBugsInfraBugsOS ErrorGPUErrorNetworkErrorNewalgorithmUnnecessarysynchronizationUnoptimizedkernelMemorymanagementGPUdowngradeNetworkjitterStart crash/hang,cant finish one stepRuntime crash/hangSlowdown compared to historical or priortraining jobsSlowdown compared to historical or p
3、riortraining stepsTraining Process(Megatron,FSDP etc.)XPUTimerTracing DaemonDiagnostic EngineEvent&StackTimingHang-error diagnosisSlowdown diagnosisErrorHangSlowdownmacrometricsmicrometricsTrainingthreadTracing threadIntercept APIIntercept APIIntercept APIPythonruntimeIntercept kernelIntercept kerne
4、lIntercept kernelCUDAruntimeTimingmanagerRecordedEvent QueueEvent PoolDiagnosticenginePythonRuntimeKernelGCSynchronizationDataloaderAPI intercept&timingcuBLASFlashAttentionNCCLCustom OPKernel intercept&event injectionetcMetricPrometheusStackTimelineCPU ThreadGPU comp streamGPU comm streamRank 0CPU ThreadGPU comp streamGPU comm streamRank 1GCstepFLOPSFLOPSbandwidthbandwidthlatencyemptyinter-stepCPUGPUDataloaderSyncCompCommCompComm用户无感低损耗,高精度轻量,友好数据效率高04https:/