大规模隐私：介绍网络隐私政策的PrivaSeer语料库(cs.IR)---用户7199428--瞎采新闻

各种组织通过在网站上推送隐私政策来展现他们的隐私政策保护举措。即使使用者经常很关心他们的电子隐私，他们也不愿意花费大量的时间与精力来读隐私政策。尽管自然语言处理过程能够有助于隐私政策的理解，但是过去仍旧缺少一个能用于分析，理解，简化隐私政策的大范围“隐私政策语料库”。因此，我们创造了PrivaSeer，一个建立于百万个英语网站的隐私政策之上的语料库，比以往的可用语料库要大很多。我们设计了一个创建语料的传输管道，在爬取网页的过程中，通过语言识别，文本分类，复制和近似复制内容的移除和文本提取来实现文本过滤。我们研究了语料库的组成，展现了可读性测试，文本相似性以及关键词提取的结果，并且通过主题模型进一步研究了语料库。

原文题目：Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies

原文：Organisations disclose their privacy practices by posting privacy policies on their website. Even though users often care about their digital privacy, they often don't read privacy policies since they require a significant investment in time and effort. Although natural language processing can help in privacy policy understanding, there has been a lack of large scale privacy policy corpora that could be used to analyse, understand, and simplify privacy policies. Thus, we create PrivaSeer, a corpus of over one million English language website privacy policies, which is significantly larger than any previously available corpus. We design a corpus creation pipeline which consists of crawling the web followed by filtering documents using language detection, document classification, duplicate and near-duplication removal, and content extraction. We investigate the composition of the corpus and show results from readability tests, document similarity, keyphrase extraction, and explored the corpus through topic modeling.

原文作者：Mukund Srinath, Shomir Wilson, C. Lee Giles

原文地址：https://arxiv.org/abs/2004.11131

大规模隐私：介绍网络隐私政策的PrivaSeer语料库（cs.IR）.pdf ---来自腾讯云社区的---用户7199428

给这篇文章的作者打赏

关于作者: 瞎采新闻

相关文章

热门文章

1渗透利器 | 常见的WebShell管理工具---Bypass

2美国新冠病毒确诊人数统计及预测---用户5908113

3什么时候使用 useMemo 和 useCallback---Nealyang

4Android开发 - NFC基础---zhangyunfeiVir

5Gitlab配置Web Hook关联Jenkins实现push后自动部署---zhangyunfeiVir