当前位置:首页 >英文主页 >中英对照 > 报告详情

DeepSeek VL技术报告(英文版)(33页).pdf

上传人: 淘*** 编号:650872 2025-04-07 33页 5.80MB

下载:

1、DeepSeek-VL:Towards Real-World Vision-LanguageUnderstandingHaoyu Lu*1,Wen Liu*1,Bo Zhang*1,Bingxuan Wang1,Kai Dong1,Bo Liu1,Jingxiang Sun1,Tongzheng Ren1,Zhuoshu Li1,Hao Yang1,Yaofeng Sun1,Chengqi Deng1,Hanwei Xu1,Zhenda Xie1,Chong Ruan11DeepSeek-AIneal,liuwen,https:/ present DeepSeek-VL,an open-sou

2、rce Vision-Language(VL)Model designed for real-worldvision and language understanding applications.Our approach is structured around three keydimensions:Data Construction:We strive to ensure our data is diverse,scalable and extensively coversreal-world scenarios including web screenshots,PDFs,OCR,ch

3、arts,and knowledge-basedcontent(expert knowledge,textbooks),aiming for a comprehensive representation of practicalcontexts.Further,we create a use case taxonomy from real user scenarios and construct aninstruction-tuning dataset accordingly.The fine-tuning with this dataset substantially improvesthe

4、 models user experience in practical applications.Model Architecture:Considering efficiency and the demands of most real-world scenarios,DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolutionimages(1024 x 1024)within a fixed token budget,while maintaining a relat

5、ively low computa-tional overhead.This design choice ensures the models ability to capture critical semantic anddetailed information across various visual tasks.Training Strategy:We posit that a proficient Vision-Language Model should,foremost,possess strong language abilities.To ensure the preserva

6、tion of LLM capabilities duringpretraining,we investigate an effective VL pretraining strategy by integrating LLM trainingfrom the beginning and carefully managing the competitive dynamics observed between visionand language modalities.Starting with a focus on text,we gradually adjust the ratio to f

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
本文介绍了DeepSeek-VL,一种开源的视觉语言(VL)模型,旨在实现对现实世界视觉和语言理解应用的全面支持。该模型围绕三个关键维度构建:数据构建、模型架构和训练策略。数据构建方面,DeepSeek-VL从多种来源收集数据,包括网页截图、PDF、OCR、图表和基于知识的文本(如专家知识、教科书),以实现对实际场景的全面覆盖。模型架构方面,DeepSeek-VL采用混合视觉编码器,高效处理高分辨率图像(1024 x 1024),同时保持较低的计算开销。训练策略方面,模型在预训练阶段保持至少70%的语言数据,以保持语言能力。DeepSeek-VL在多个视觉语言基准测试中表现出色,甚至在某些任务上超过了更大规模的商业模型。
DeepSeek-VL如何处理高分辨率图像? 数据构建在DeepSeek-VL中起什么作用? DeepSeek-VL如何平衡视觉和语言能力?
客服
商务合作
小程序
服务号
折叠