在训练多语种机器翻译(MT)模型时,我们常面临着训练集不均衡的问题:有些语言的训练数据要比其他语言多得多。标准解决方法是对资源较少的语言进行上采样来优化表现,而上采样的程度对总体性能有很大影响。在这篇文章中,我们提出了一种新方法,就是使机器自动学习如何通过一个数据记分器来加权训练数据,该记分器被尽可能地优化以使所有测试语言的性能最大化。在一对多和多对一机器翻译环境下对两种语言集的实验表明,我们的方法不仅在平均性能方面始终优于启发式基线,而且对优化语言的性能提供了灵活的控制。
原文题目:Balancing Training for Multilingual Neural Machine Translation
原文:When training multilingual machine translation (MT) models that can translate to/from multiple languages, we are faced with imbalanced training sets: some languages have much more training data than others. Standard practice is to up-sample less resourced languages to increase representation, and the degree of up-sampling has a large effect on the overall performance. In this paper, we propose a method that instead automatically learns how to weight training data through a data scorer that is optimized to maximize performance on all test languages. Experiments on two sets of languages under both one-to-many and many-to-one MT settings show our method not only consistently outperforms heuristic baselines in terms of average performance, but also offers flexible control over the performance of which languages are optimized.
原文作者:Xinyi Wang, Yulia Tsvetkov, Graham Neubig
原文链接:https://arxiv.org/abs/2004.06748
多语种神经机器翻译的平衡训练(CS CL).pdf ---来自腾讯云社区的---Elva
微信扫一扫打赏
支付宝扫一扫打赏