当前位置:首页 > 报告详情

识别官方公司网站:基于机器学习的 URL 检索方法与人工智能搜索引擎的比较.pdf

上传人: Fl****zo 编号:718626 2025-06-22 24页 965.76KB

1、Donato Summa Web Intelligence Network Conference“From Web to Data”Gdask 4-5/02/2025Web Intelligenge Network ConferenceFrom Web to DataIdentifying Official Firm Websites:A Comparison of Machine Learning-Based URL Retrieval Methods and AI-Powered Search EnginesDonato SummaURL retrievalAll NSIs maintai

2、n extensive administrative information on a long list of national enterprisesunfortunatelythe corresponding list of official website addresses is largely incomplete(at least in Italy).Donato Summa Web Intelligence Network Conference“From Web to Data”Gdask 4-5/02/2025URL retrievalWe need the official

3、 addresses(URLs)of enterprise websites to extract information from their contentbutmanually retrieving official enterprise URLs is a very time-consuming operationsothe idea is to retrieve them automatically!Donato Summa Web Intelligence Network Conference“From Web to Data”Gdask 4-5/02/2025URL retrie

4、valIn the previous ESSnet Big Data 1 and Big Data 2 projects,among other things,we developed and improved URL retrieval systems at the national level.Donato Summa Web Intelligence Network Conference“From Web to Data”Gdask 4-5/02/2025Istat URL retrieval pipelineDonato Summa Web Intelligence Network C

5、onference“From Web to Data”Gdask 4-5/02/2025OBEC annotation exerciseGoal:create an annotated dataset of enterprise-URL pairs Annotation is used to assess the quality of data processing and retrieval pipelines related to,among other things,enterprise URLs(Does the enterprise have one or more website

6、and what are they?)For each country,a sample of 500 legal units was drawn from the 2024 ICT sampling population,stratified by:NACE section(first-level NACE code)enterprise size(10-49,50-249,250+employees)Additional rules:put NACE sections with less than 5%of the sampling population into 1 category m

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
本文主要比较了基于机器学习的URL检索方法和人工智能搜索引擎在识别官方企业网站方面的效果。关键点如下: 1. 官方企业网站地址列表不完整,手动检索耗时,因此提出了自动检索方法。 2. 在ESSnet大数据项目中,作者所在团队开发了国家级的URL检索系统。 3. 通过标注练习评估数据处理的品质,样本包括500个法律单元,按行业和公司规模分层。 4. 检索准确性:手动检索与Istat管道(Llama 3.1 8B)相比,整体准确率为0.896;而AI搜索引擎的共识准确率为43%。 5. Istat管道虽提供最佳结果,但耗时且需维护。 6. 当前实践中,对于约43%的记录可完全自动化,剩余的则依赖现有系统。 7. 作者指出,未来AI搜索引擎的性能差距可能缩小,最终可能仅用它们进行URL检索。 核心数据:整体准确率0.896;AI搜索引擎共识准确率43%。
"如何高效找到企业官网?" "AI搜索能替代手动找官网吗?" "企业官网寻找,AI表现如何?"
客服
商务合作
小程序
服务号
折叠