1、SPANN:Highly-efficient Billion-scale Approximate Nearest Neighbor SearchQiQi ChenChen1,BingZhao1,2,HaidongWang1,MingqinLi1,ChuanjieLiu1,3,ZengzhongLi1,MaoYang1 and JingdongWang1,41Microsoft,2PekingUniversity,3Tencent,4BaiduVector is the future of data representationThe success of deep learning turns
2、 everything into vectorsNatural language(text),Computer vision(image,video),Speech,etc.Vector Search DefinitionGiven a set of data vectors ,and a query q Vector search is ubiquitousSimilar Image SearchSimilar Item Search in RecommendationSimilar Answer Search in QADistance(e.g.Euclidean,Cosine)qxMet
3、hodologies5Reduce the distance computation numberReduce the distance computation costReduce memory costNearest Neighbor SearchNNGHNSWKD-treeSemantic word 1Semantic word 2Semantic word 3TP-treeHash to the bucket and then search among points in the bucketSPTAG:Space Partition Tree and Neighborhood Gra
4、ph https:/ stars)New features:fresh update and GPU index build(10X faster)BKTLess building cost,work well on low dimensional dataBetter accuracy in high dimensional dataEmpirical comparison of SPTAG and HNSW SPTAGHNSWNSGFAISSRelative neighborhood graphQuantizationAccuracy and efficiencyHighHighHighL
5、owMemory efficiencyLowLowLowHighStabilityGoodPoorPoorGoodDL Era Challenge:Super Large Scale Vector SearchCapacity:Vectors cannot be all fit into memoryCache has low efficiency:50%memory only caches 80%accessScalability:Increasing machines increases latency and resource cost query vectorSend to each
6、server TopKTopKTopKTopKresultsMerge resultsSynchronization Barrier Current State-of-the-art Large Scale ANNS SolutionsInverted-index based methodsDivide the data vector space into K clusters(posting lists)Store PQ compressed vectors and posting lists all in memoryOnly do search in a few posting list