当前位置:首页 > 报告详情

W4_Sulin.pptx

上传人: 拾起 编号:1235565 2026-05-04 24页 405.39KB

1、,RAG Recipes,Retrieval techniques on a travel corpus,Patrick Sulin Data Summit BostonColab Jupyter Notebook-https:/,1,On the menu,Plus brief detours on ColBERT and hybrid retrieval same fusion idea,different inputs.,Corpus:11,319 chunks 305 WikiVoyage destinations all-MiniLM-L6-v2(384-d),Follow alon

2、g:huggingface.co/datasets/patjs/rag-workshop,2,Whats actually in the corpus,BALI EAT,Restaurants catering to tourists do nearly always provide some vegetarian options,and in places like Seminyak and Ubud there are even dedicated vegetarian restaurants.,BALI DO,Warm waters keep Bali near the top of w

3、orld surfing destinations.Expert surfers head for the big breaks off the Bukit Peninsula;beginners stick to the sandy areas between Kuta and Legian to learn.,CORPUS SHAPE,305 destinations 11,319 chunksmedian 26 chunks per destination(max 156)top sections:Get around Understand Get in Do Eat See,CHUNK

4、ING STRATEGY,Section-aligned:each WikiVoyage section starts as one chunkLong sections split by paragraph target 300 words,max 600Paragraphs intact no sliding window no overlap,One destination,many chunks.Each chunk carries one aspect by construction.,3,Two queries,Query 1,Whats there to do in Icelan

5、d?,One named destination one constraint basic RAG should nail it.,Query 2,Tropical destinations with great snorkeling and vegetarian food,Three constraints in one sentence:geography activity diet.,4,Basic RAG,Query 1 Icelandtop-1 0.671 all five top results:Iceland chunks,Query 2 tropicaltop-1 0.578

6、R20=3/21 14%,5,Bad retrieval bad answer.,6,Reranking,query bi-encoder top-N candidates cross-encoder top-k reordered,(fast)(slow,accurate),The cross-encoder can spend serious capacity because we only ask it about the small handful of candidates the bi-encoder surfaced.,Two-stage retrieval is a commo

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
拾起
拾起

该用户很懒,什么也没介绍

客服
商务合作
小程序
服务号
折叠