1、,RAG Recipes,Retrieval techniques on a travel corpus,Patrick Sulin Data Summit BostonColab Jupyter Notebook-https:/,1,On the menu,Plus brief detours on ColBERT and hybrid retrieval same fusion idea,different inputs.,Corpus:11,319 chunks 305 WikiVoyage destinations all-MiniLM-L6-v2(384-d),Follow alon
2、g:huggingface.co/datasets/patjs/rag-workshop,2,Whats actually in the corpus,BALI EAT,Restaurants catering to tourists do nearly always provide some vegetarian options,and in places like Seminyak and Ubud there are even dedicated vegetarian restaurants.,BALI DO,Warm waters keep Bali near the top of w
3、orld surfing destinations.Expert surfers head for the big breaks off the Bukit Peninsula;beginners stick to the sandy areas between Kuta and Legian to learn.,CORPUS SHAPE,305 destinations 11,319 chunksmedian 26 chunks per destination(max 156)top sections:Get around Understand Get in Do Eat See,CHUNK
4、ING STRATEGY,Section-aligned:each WikiVoyage section starts as one chunkLong sections split by paragraph target 300 words,max 600Paragraphs intact no sliding window no overlap,One destination,many chunks.Each chunk carries one aspect by construction.,3,Two queries,Query 1,Whats there to do in Icelan
5、d?,One named destination one constraint basic RAG should nail it.,Query 2,Tropical destinations with great snorkeling and vegetarian food,Three constraints in one sentence:geography activity diet.,4,Basic RAG,Query 1 Icelandtop-1 0.671 all five top results:Iceland chunks,Query 2 tropicaltop-1 0.578
6、R20=3/21 14%,5,Bad retrieval bad answer.,6,Reranking,query bi-encoder top-N candidates cross-encoder top-k reordered,(fast)(slow,accurate),The cross-encoder can spend serious capacity because we only ask it about the small handful of candidates the bi-encoder surfaced.,Two-stage retrieval is a commo