在包含数十万个文档-摘要配对的大数据集上对大容量模型进行有监督的训练,是最近深度学习技术在抽象概括方面成功的关键。遗憾的是,在大多数领域(除新闻领域外),这样的训练数据是无法获得的,也不容易获得。在本文中,我们将监督学习用于只有文档(例如,产品或商业评论)而没有地面真实性总结的情况。我们从一个用户评论的语料库中创建一个合成数据集,通过对一个评论进行采样,假装它是一个摘要,并生成其中的嘈杂版本,我们将其作为伪评论输入。我们介绍了几种语言驱动的噪声生成函数和一个总结模型,该模型可以学习对输入进行去噪,并生成原始评论。在测试时,该模型接受真实的评论,并生成包含突出意见的摘要,将未达成共识的评论视为噪音。大量的自动评估和人工评估表明,我们的模型相比于抽象基线和提取基线都有很大的改进。
原文题目:Unsupervised Opinion Summarization with Noising and Denoising
原文:The supervised training of high-capacity models on large datasets containing hundreds of thousands of document-summary pairs is critical to the recent success of deep learning techniques for abstractive summarization. Unfortunately, in most domains (other than news) such training data is not available and cannot be easily sourced. In this paper we enable the use of supervised learning for the setting where there are only documents available (e.g., product or business reviews) without ground truth summaries. We create a synthetic dataset from a corpus of user reviews by sampling a review, pretending it is a summary, and generating noisy versions thereof which we treat as pseudo-review input. We introduce several linguistically motivated noise generation functions and a summarization model which learns to denoise the input and generate the original review. At test time, the model accepts genuine reviews and generates a summary containing salient opinions, treating those that do not reach consensus as noise. Extensive automatic and human evaluation shows that our model brings substantial improvements over both abstractive and extractive baselines.
原文作者:Reinald Kim Amplayo, Mirella Lapata
原文地址:https://arxiv.org/abs/2004.10150
噪声和去噪的无监督意见总结(CS CL).pdf ---来自腾讯云社区的---刘持诚
微信扫一扫打赏
支付宝扫一扫打赏