英伟达：2026 Nemotron 3 Super技术报告（英文原版+译版）（51页）.pdf

英伟达：2026 Nemotron 3 Super技术报告（英文版）（51页）.pdf

《英伟达：2026 Nemotron 3 Super技术报告（英文版）（51页）.pdf》由会员分享，可在线阅读，更多相关《英伟达：2026 Nemotron 3 Super技术报告（英文版）（51页）.pdf（51页珍藏版）》请在三个皮匠报告上搜索。

1、2026-4-3Nemotron 3 Super:Open,EfficientMixture-of-Experts Hybrid Mamba-TransformerModel for Agentic ReasoningNVIDIAAbstract.We describe the pre-training,post-training,and quantization of Nemotron 3 Super,a 120 billion(active 12 billion)parameter hybrid Mamba-Attention Mixture-of-Experts model.Nemotr

2、on 3 Superis the first model in the Nemotron 3 family to 1)be pre-trained in NVFP4,2)leverage LatentMoE,a new Mixture-of-Experts architecture that optimizes for both accuracy per FLOP and accuracy perparameter,and 3)include MTP layers for inference acceleration through native speculative decoding.We

3、 pre-trained Nemotron 3 Super on 25 trillion tokens followed by post-training using supervisedfine tuning(SFT)and reinforcement learning(RL).The final model supports up to 1M contextlength and achieves comparable accuracy on common benchmarks,while also achieving up to 2.2and 7.5higher inference thr

4、oughput compared to GPT-OSS-120B and Qwen3.5-122B,respectively.Nemotron 3 Super datasets,along with the base,post-trained,and quantized checkpoints,areopen-sourced on HuggingFace.1.IntroductionThe last few years have seen a rise in the popularity of Mixture-of-Experts(MoE)based LargeLanguage Models(

5、LLMs)(DeepSeek-AI,2025c;Yang et al.,2025;GLM-4.5-Team,2025).MoEs helpLLMs achieve higher accuracy at a lower active parameter count than regular dense models(Daiet al.,2024;Lepikhin et al.,2020).Orthogonal to MoEs,Hybrid Mamba-Attention models haveshown promise in significantly improving inference t

6、hroughput(NVIDIA,2025c).We combine thesetwo directions of improvement in Nemotron 3(NVIDIA,2025c).As part of our Nemotron 3 series ofmodels,we present Nemotron 3 Supera 12 billion active,120 billion total parameter MoE hybridMamba-Attention model.Nemotron 3 Super achieves better or on-par benchmark