您的位置 首页 > 腾讯云社区

大规模隐私:介绍网络隐私政策的PrivaSeer语料库(cs.IR)---用户7199428

各种组织通过在网站上推送隐私政策来展现他们的隐私政策保护举措。即使使用者经常很关心他们的电子隐私,他们也不愿意花费大量的时间与精力来读隐私政策。尽管自然语言处理过程能够有助于隐私政策的理解,但是过去仍旧缺少一个能用于分析,理解,简化隐私政策的大范围“隐私政策语料库”。因此,我们创造了PrivaSeer,一个建立于百万个英语网站的隐私政策之上的语料库,比以往的可用语料库要大很多。我们设计了一个创建语料的传输管道,在爬取网页的过程中,通过语言识别,文本分类,复制和近似复制内容的移除和文本提取来实现文本过滤。我们研究了语料库的组成,展现了可读性测试,文本相似性以及关键词提取的结果,并且通过主题模型进一步研究了语料库。

原文题目:Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies

原文:Organisations disclose their privacy practices by posting privacy policies on their website. Even though users often care about their digital privacy, they often don't read privacy policies since they require a significant investment in time and effort. Although natural language processing can help in privacy policy understanding, there has been a lack of large scale privacy policy corpora that could be used to analyse, understand, and simplify privacy policies. Thus, we create PrivaSeer, a corpus of over one million English language website privacy policies, which is significantly larger than any previously available corpus. We design a corpus creation pipeline which consists of crawling the web followed by filtering documents using language detection, document classification, duplicate and near-duplication removal, and content extraction. We investigate the composition of the corpus and show results from readability tests, document similarity, keyphrase extraction, and explored the corpus through topic modeling.

原文作者:Mukund Srinath, Shomir Wilson, C. Lee Giles

原文地址:https://arxiv.org/abs/2004.11131

大规模隐私:介绍网络隐私政策的PrivaSeer语料库(cs.IR).pdf ---来自腾讯云社区的---用户7199428

关于作者: 瞎采新闻

这里可以显示个人介绍!这里可以显示个人介绍!

热门文章

留言与评论(共有 0 条评论)
   
验证码: