当前位置：首页 >英文主页 >中英对照 > 报告详情

DeepSeek VL技术报告（英文版）（33页）.pdf

上传人：淘*** 编号：650872 2025-04-07 PDF PDF 中文版中文版中文版 DOCX DOCX DOCX 33页 5.80MB 19张图表

下载：

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载报告到电脑，查找使用更方便

VIP专享文档

书签

分享

收藏

已收藏

版权投诉

/33

立即下载

《DeepSeek VL技术报告（英文版）（33页）.pdf》由会员分享，可在线阅读，更多相关《DeepSeek VL技术报告（英文版）（33页）.pdf（33页珍藏版）》请在三个皮匠报告上搜索。

1、DeepSeek-VL:Towards Real-World Vision-LanguageUnderstandingHaoyu Lu*1,Wen Liu*1,Bo Zhang*1,Bingxuan Wang1,Kai Dong1,Bo Liu1,Jingxiang Sun1,Tongzheng Ren1,Zhuoshu Li1,Hao Yang1,Yaofeng Sun1,Chengqi Deng1,Hanwei Xu1,Zhenda Xie1,Chong Ruan11DeepSeek-AIneal,liuwen,https:/ present DeepSeek-VL,an open-sou

2、rce Vision-Language(VL)Model designed for real-worldvision and language understanding applications.Our approach is structured around three keydimensions:Data Construction:We strive to ensure our data is diverse,scalable and extensively coversreal-world scenarios including web screenshots,PDFs,OCR,ch

3、arts,and knowledge-basedcontent(expert knowledge,textbooks),aiming for a comprehensive representation of practicalcontexts.Further,we create a use case taxonomy from real user scenarios and construct aninstruction-tuning dataset accordingly.The fine-tuning with this dataset substantially improvesthe

4、 models user experience in practical applications.Model Architecture:Considering efficiency and the demands of most real-world scenarios,DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolutionimages(1024 x 1024)within a fixed token budget,while maintaining a relat

5、ively low computa-tional overhead.This design choice ensures the models ability to capture critical semantic anddetailed information across various visual tasks.Training Strategy:We posit that a proficient Vision-Language Model should,foremost,possess strong language abilities.To ensure the preserva

6、tion of LLM capabilities duringpretraining,we investigate an effective VL pretraining strategy by integrating LLM trainingfrom the beginning and carefully managing the competitive dynamics observed between visionand language modalities.Starting with a focus on text,we gradually adjust the ratio to f

word格式文档无特别注明外均可编辑修改，预览文件经过压缩，下载原文更清晰！

三个皮匠报告文库所有资源均是客户上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作商用。

本文介绍了DeepSeek-VL，一种开源的视觉语言（VL）模型，旨在实现对现实世界视觉和语言理解应用的全面支持。该模型围绕三个关键维度构建：数据构建、模型架构和训练策略。数据构建方面，DeepSeek-VL从多种来源收集数据，包括网页截图、PDF、OCR、图表和基于知识的文本（如专家知识、教科书），以实现对实际场景的全面覆盖。模型架构方面，DeepSeek-VL采用混合视觉编码器，高效处理高分辨率图像（1024 x 1024），同时保持较低的计算开销。训练策略方面，模型在预训练阶段保持至少70%的语言数据，以保持语言能力。DeepSeek-VL在多个视觉语言基准测试中表现出色，甚至在某些任务上超过了更大规模的商业模型。

DeepSeek-VL如何处理高分辨率图像？数据构建在DeepSeek-VL中起什么作用？ DeepSeek-VL如何平衡视觉和语言能力？

全行业研究报告分享下载平台

0731-84720580
商务合作：really158d
友链申请 (QQ)：1737380874

关于我们

更多

关于我们

三个皮匠报告微信公众号

三个皮匠报告微信小程序

扫码咨询商务合作事宜

友情链接：

营销自动化亿欧智库微播易阿里妈妈

copyright@2008-2013 长沙思想领动信息技术有限公司版权所有网站备案/许可证号：湘B2-20190120 | 工信部备案号：湘ICP备2023027541号-2 | 公安备案号：湘公网安备43010402001071号

客服

小程序

服务号

折叠