MultiMWE：构建多语言多词表达（MWE）并行语料库（CS CL）---刘子蔚--瞎采新闻

多词表达（MWE）是自然语言处理（NLP）研究中的热门话题，包括MWE检测，MWE分解以及在其他NLP领域（例如机器翻译）中对MWE的利用进行研究的研究。但是，双语或多语MWE语料库的可用性非常有限。我们知道的唯一的双语MWE语料库来自PARSEME（PARSing和多词表达式）EU项目。这是仅871对英德MWE的一个很小的集合。在本文中，我们介绍了我们从根并行语料库中提取的多语言和双语MWE语料库。经过过滤后，我们分别收集了3,159,226个和143,042个双语MWE对，分别用于德语-英语和中文-英语。我们在MT实验中检查了这些提取的双语MWE的质量。我们在MT中使用MWE的初步实验显示，在德语-英语和中文-英语对上，定性分析中MWE术语的翻译性能得到改善，定量分析中的MWE术语具有更好的综合评价得分。我们遵循标准的实验流程来创建可在线获得的MultiMWE语料库。研究人员可以将此自由语料库用于自己的模型，也可以在知识库中将其用作模型特征。

原文标题：MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora

原文：Multi-word expressions (MWEs) are a hot topic in research in natural language processing (NLP), including topics such as MWE detection, MWE decomposition, and research investigating the exploitation of MWEs in other NLP fields such as Machine Translation. However, the availability of bilingual or multi-lingual MWE corpora is very limited. The only bilingual MWE corpora that we are aware of is from the PARSEME (PARSing and Multi-word Expressions) EU Project. This is a small collection of only 871 pairs of English-German MWEs. In this paper, we present multi-lingual and bilingual MWE corpora that we have extracted from root parallel corpora. Our collections are 3,159,226 and 143,042 bilingual MWE pairs for German-English and Chinese-English respectively after filtering. We examine the quality of these extracted bilingual MWEs in MT experiments. Our initial experiments applying MWEs in MT show improved translation performances on MWE terms in qualitative analysis and better general evaluation scores in quantitative analysis, on both German-English and Chinese-English language pairs. We follow a standard experimental pipeline to create our MultiMWE corpora which are available online. Researchers can use this free corpus for their own models or use them in a knowledge base as model features.

原文作者：Lifeng Han, Gareth J.F. Jones, Alan F. Smeaton

原文地址：https://arxiv.org/abs/2005.10583

MultiMWE Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora.pdf ---来自腾讯云社区的---刘子蔚

给这篇文章的作者打赏

关于作者: 瞎采新闻

相关文章

热门文章

1渗透利器 | 常见的WebShell管理工具---Bypass

2美国新冠病毒确诊人数统计及预测---用户5908113

3什么时候使用 useMemo 和 useCallback---Nealyang

4Lua table 如何实现最快的 insert?---poslua

5Swift 实现腾讯云 TC3-HMAC-SHA256 签名方法---韦弦zhy