当前位置:首页 > 报告详情

获取和使用网络抓取数据的质量指南.pdf

上传人: Fl****zo 编号:718605 2025-06-22 15页 344.65KB

1、Web Intelligence Network ConferenceFrom Web to Data4-5 February 2025 GDANSK-POLAND Quality Guidelines for acquiring and using web scraped dataESSnet WIN,WP4Magdalena Six,Alexander Kowarik,Manveer Mangat,Johannes Gussenbauer(AUT)Outlineo Organisational backgroundo Statistical production process incl.

2、web-datao Theoretical Framework for Landscapingo Examples of quality guidelines in the throughput phaseo Guidelines for a centralized webscraping platformOrganisational backgroundSubgroups of WP4 of ESSnet WINMethodologyDeliverable 4.6:WP4 Methodology report on using webscraped dataArchitecture Deli

3、verable D4.7:BREAL-Big Data REference Architecture and Layers for web scraped dataQuality Deliverable 4.5:Quality Guidelines for acquiring and using web scraped dataQuality Assessment Deliverable 4.8:Quality Assessment for the Statistical Use of Web Scraped DataAll deliverables of WP4 at https:/ pro

4、cesses along the production processQuality-relevant processes along the production processSpotlight:LandscapingDefinition:Landscaping refers to the cataloguing and measurement of all web-based data sources relevant for the topic of interest.The effort of landscaping varies depending on the topic of

5、interest:All needed data might be available on one websiteExample:satellite dataThe great extent of existing websites and the impossibility to scrape and combine them all makes it necessary to select websitesExamples:online job advertisements,real estate prices or price statisticsAll websites w.r.t.

6、topic of interest should be scraped,combination of ingested information is possibleExample:enterprise characteristicsLandscaping:Selection of websitesWhich websites to scrape?-Most important ones?Highest quality?-Score is neededThree groups of information to take into account:Information from the we

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
本文主要介绍了Web Intelligence Network Conference关于网络数据抓取的质量指南。关键点如下: 1. 组织背景:介绍了ESSnet WIN的WP4工作组,涉及方法论、架构、质量和质量评估等方面。 2. 核心过程:强调数据抓取过程中的质量相关环节,如landscaping(筛选相关网站)、检测概念漂移和去重。 3. 网站筛选:提出基于网站信息、网站元信息及经验,采用多准则决策模型对网站进行评分和排名。 4. 指南实例:包括测量来源受欢迎度变化、去重策略(如使用唯一标识和地址特征)以及注释练习的质量评估。 5. 中心化抓取平台:提出技术要求,如流程顺畅、可移植性、开源优先、模块化、用户友好的访问方式、透明的元数据、调度和资源管理等。 核心数据引用: - 网站评分标准:信息、元信息、经验。 - 评分模型:多准则决策模型。 - 去重策略:使用唯一标识和地址特征。 - 注释练习:样本设计、定义时间范围、建立注释指南。
"如何选择最佳网站抓取?" "数据抓取中的去重策略有哪些?" "怎样评估网络数据的真实质量?"
客服
商务合作
小程序
服务号
折叠