当前位置:首页 >英文主页 >中英对照 > 报告详情

英伟达(NVIDIA):Cosmos 3:面向物理AI的全模态世界模型技术报告(英文版)(139页).pdf

上传人: 小*** 编号:1271260 2026-06-25 139页 27.97MB

下载:

1、2026-6-22Cosmos 3:Omnimodal World Models for Physical AINVIDIA1AbstractWe introduce Cosmos 3,a family of omnimodal world models designed to jointly process and generate lan-guage,image,video,audio,and action sequences within a unified mixture-of-transformers architecture.By supporting highly flexibl

2、e input-output configurations,Cosmos 3 seamlessly unifies critical modalitiesfor Physical AIeffectively subsuming vision-language models,video generators,world simulators,andworld-action models into a single framework.Our evaluation demonstrates that Cosmos 3 establishesa new state-of-the-art across

3、 a diverse suite of understanding and generation tasks,demonstratingomnimodal world models as scalable,general-purpose backbones for embodied agents.Our post-trainedCosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Arti-ficial Analysis,and the best policy

4、 model by RoboArena at the time the technical report was written.Toaccelerate open research and deployment in Physical AI,we make our code,model checkpoints,curatedsynthetic datasets,and evaluation benchmark available under the Linux Foundations OpenMDW-1.1License at and huggingface.co/collections/n

5、vidia/cosmos3.The projectwebsite is available at CodeC Model CheckpointCosmos3-Superhuggingface.co/nvidia/Cosmos3-SuperCosmos3-Nanohuggingface.co/nvidia/Cosmos3-NanoCosmos3-Super-Text2Imagehuggingface.co/nvidia/Cosmos3-Super-Text2ImageCosmos3-Super-Image2Videohuggingface.co/nvidia/Cosmos3-Super-Imag

6、e2VideoCosmos3-Nano-Policy-DROIDhuggingface.co/nvidia/Cosmos3-Nano-Policy-DROIDOpen Synthetic DatasetSDG-PhyxSimhuggingface.co/datasets/nvidia/PhysicalAI-WorldModel-Synthetic-Physical-Interaction-ScenesSDG-RobotSimhuggingface.co/datasets/nvidia/PhysicalAI-WorldModel-Synthetic-Embodied-Robot-ScenesSD

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
1. **模型概述**:NVIDIA推出Cosmos 3,一种多模态世界模型,统一处理语言、图像、视频、音频和动作序列,采用混合Transformer架构(MoT),支持灵活输入输出配置,整合视觉语言模型、视频生成器、世界模拟器和动作模型。 2. **性能表现**:在多项任务中达到SOTA,如Text-to-Image和Image-to-Video生成模型在Artificial Analysis排名第一,机器人策略模型在RoboArena领先。 3. **开源资源**:代码、模型(如Cosmos3-Super/Nano)、合成数据集(SDG系列)及评估基准(Cosmos-HUE)已开源,地址为github.com/nvidia/cosmos和huggingface.co/collections/nvidia/cosmos3。 4. **架构设计**:包含多模态编码器、双塔层结构(推理器/生成器)、3D多模态位置嵌入,支持多种生成模式(如文本生成视频、动作预测)。 5. **训练数据**:推理器使用24.2M样本(22.0M预训练+2.2M微调),生成器依赖大规模多模态数据,涵盖物理AI任务(机器人、自动驾驶等)。
Cosmos 3是什么? 如何训练Cosmos 3? Cosmos 3有何优势?
客服
商务合作
小程序
服务号
折叠