开放域对话框系统评估是对话框研究中最重要的挑战之一。现有的自动评估指标(例如BLEU)大多基于参考。他们计算生成的响应与有限数量的可用参考之间的差异。社交对话系统(例如Amazon Alexa Prize聊天机器人)广泛采用基于Likert评分的自我报告用户评分。但是,自我报告的用户评分会受到不同用户之间的偏差和差异的影响。为了缓解此问题,我们将对话框评估公式化为比较任务。我们还提出了一种自动评估模型CMADE(用于自动对话框评估的比较模型),该模型在对自己报告的用户评分进行训练时会自动将其清除。特别,我们首先使用一种自我监督的方法来学习更好的对话框特征表示,然后使用KNN和Shapley去除令人困惑的样本。我们的实验表明,CMADE在对话比较任务中可达到89.2%的准确性。
原文标题:Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation
原文:Open Domain dialog system evaluation is one of the most important challenges in dialog research. Existing automatic evaluation metrics, such as BLEU are mostly reference-based. They calculate the difference between the generated response and a limited number of available references. Likert-score based self-reported user rating is widely adopted by social conversational systems, such as Amazon Alexa Prize chatbots. However, self-reported user rating suffers from bias and variance among different users. To alleviate this problem, we formulate dialog evaluation as a comparison task. We also propose an automatic evaluation model CMADE (Comparison Model for Automatic Dialog Evaluation) that automatically cleans self-reported user ratings as it trains on them. Specifically, we first use a self-supervised method to learn better dialog feature representation, and then use KNN and Shapley to remove confusing samples. Our experiments show that CMADE achieves 89.2% accuracy in the dialog comparison task.
原文作者:Weixin Liang, James Zou, Zhou Yu
原文地址:https://arxiv.org/abs/2005.10716
Beyond User Self-Reported Likert Scale Ratings A Comparison Model for Automatic Dialog Evaluation.pdf ---来自腾讯云社区的---刘子蔚
微信扫一扫打赏
支付宝扫一扫打赏