神经模型常常利用表面的(“弱”)特性来获得良好的性能,而不是派生出我们希望模型使用的更一般的(“强”)特性。克服这种倾向是表征学习和ML公平性等领域的核心挑战。最近的工作提出使用数据扩充——即生成这些弱特性失败的训练示例——作为一种鼓励模型选择更强特性的方法。我们设计了一系列的玩具学习问题来研究在何种情况下这些数据的增加是有帮助的。我们证明,增加弱特征失败的训练实例(“反例”)确实能够成功地防止模型依赖弱特征,但通常不能成功地鼓励模型使用更强的特征。我们还发现,在许多情况下,达到给定错误率所需的反例数量与训练数据的数量无关,并且随着目标强特征变得更难学习,这种类型的数据增强效果也会变差。
原文题目:When does data augmentation help generalization in NLP?
原文:Neural models often exploit superficial ("weak") features to achieve good performance, rather than deriving the more general ("strong") features that we'd prefer a model to use. Overcoming this tendency is a central challenge in areas such as representation learning and ML fairness. Recent work has proposed using data augmentation--that is, generating training examples on which these weak features fail--as a means of encouraging models to prefer the stronger features. We design a series of toy learning problems to investigate the conditions under which such data augmentation is helpful. We show that augmenting with training examples on which the weak feature fails ("counterexamples") does succeed in preventing the model from relying on the weak feature, but often does not succeed in encouraging the model to use the stronger feature in general. We also find in many cases that the number of counterexamples needed to reach a given error rate is independent of the amount of training data, and that this type of data augmentation becomes less effective as the target strong feature becomes harder to learn.
原文作者:Rohan Jha, Charles Lovering, Ellie Pavlick
原文地址:https://arxiv.org/abs/2004.15012
数据扩充在NLP中什么时候有助于泛化_(CS.CL).pdf ---来自腾讯云社区的---用户7236395
微信扫一扫打赏
支付宝扫一扫打赏