当前位置:首页 > 报告详情

选择性抓取、采样和其他方法以尽量减少已知的网络数据偏差原因.pdf

上传人: Fl****zo 编号:718610 2025-06-22 13页 622.71KB

1、Selective scraping,sampling and other methods to minimize known causes of biases of web dataWeb Intelligence Network ConferenceAlexander Kowarik,Piet Daas05 February 2025Trusted Smart Statistics Web Intelligence NetworkOverview Sampling in the Context of Webscraped Statistics Methods specific to web

2、scraped data and causes of bias Co-financed by Web Intelligence Network:101035829 2020-PL-SmartStat Contributions to deliverables by several colleagues:Olav ten Bosch,Jacek Maslankowski,Magdalena Six,Johannes Gussenbauer,Sonia Quaresma and moreAll deliverables of WP4 at https:/ Memoriam:Prof.dr.Piet

3、 Daas-Methodology lead and-Main author of“Deliverable 4.6:WP4 Methodology report on using webscraped data”on which this presentation is based.Sampling what forSampling for Quality AssessmentEstimation:Probability and Non-Probability SamplingMethodology for estimation and error estimation very wellde

4、veloped and we do know sampling methodologySelective ScrapingOptimized Scraping StrategySampling for Quality Assessment Why Sampling Matters in Quality Assessment:Labor-intensive nature of manual annotation.Need for high-quality,representative annotated datasets.Optimization StrategiesReducing annot

5、ation volume with strategic sampling.Ensuring representative marginal distributions.More on this in the deliverableProbability Sampling Probability sampling if the process of deriving a target variable,is not easily scalable e.g.a statistical classification needs costly manual intervention The situa

6、tion is thus similar to a survey where each interview has a high cost and cannot be extended easily to the full population.There is a rich body of methodology developed for inference from random samples from a method for the sampling design and the applied estimation can be selected.Non-Probability

word格式文档无特别注明外均可编辑修改,预览文件经过压缩,下载原文更清晰!
三个皮匠报告文库所有资源均是客户上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作商用。
本文主要讨论了如何通过选择性抓取、采样等方法减少网络数据中的已知偏见,以提高网络数据的统计质量。关键点如下: 1. **采样目的**:为了质量评估和估计,采样在处理劳动密集型的手动标注和获取高质量、具有代表性的标注数据集中至关重要。 2. **概率与非概率采样**:概率采样适用于难以扩展的目标变量获取过程;非概率采样则适用于每个单元被包含在数据集中的概率未知但可解释的情况。 3. **选择性抓取**:选择性抓取是有意抓取目标群体的子集,使用先验知识以更可控的方式收集特定信息,目的是获取代表性数据集。 4. **过程与例子**:选择性抓取过程包括源识别、源选择和数据提取增强,举例说明了如何在一个商业领域内识别职位空缺。 5. **概念漂移**:网络数据中概念漂移尤为明显,需要定期检测模型是否仍测量预期概念。 6. **修正模型诱导偏差**:通过校正误分类偏差和比例偏差,例如通过测试集正确确定目标群体中正例的比例,以避免估计偏差。 文章强调了使用网络数据进行统计推断的挑战,指出需要适当的设计、验证和估计方法,反对“快速而粗糙”的实验方法。相关成果可于[https://github.com/WebIntelligenceNetwork/Deliverables](https://github.com/WebIntelligenceNetwork/Deliverables)获取。
"如何优化网络数据抓取?" "网络数据中的概念漂移怎么应对?" "如何校正分类模型的偏见?"
客服
商务合作
小程序
服务号
折叠