Facebook FAIR's WMT19 News Translation Task Submission

Abstract

英德和英俄方向

baseline：large BPE-based transformer models which rely on sampled back translations.

测试了不同的双语数据过滤方案，增加了过滤的回译数据。模型集成和 IND数据微调。decoding with noisy channel model reranking。

2019相较于2018提升了4.5BLEU points

Introduction

证明了反向翻译利用高质量的单语数据是非常重要的。

反向翻译了Newscrawl dataset；还对噪音更大的Commoncrawl数据集的反向翻译部分进行了实验。

Intelligent selection of language model training data

提升了4.5个BLEU points，Some of these gains can be attributed to differences in dataset quality, but we believe most of the improvement comes from larger models, larger scale back-translation, and noisy channel model reranking with strong channel and language models

Data

Data preprocessing

en-de方向
normalize punctuation 
Moses tokenizer
BPE 32k(joint byte pair encodings)

en-ru
BPE 24k for each language
对于英俄来说，使用separate BPE encoding比joint BPE效果更好

Data filtering

互联网上抓取的数据集是非常嘈杂的。

# bitext
1. langid：尽管langid不是最准确的，但是它可以过滤掉包含很多垃圾符号的非常嘈杂的句子。
2. 删除句子tokens长度大于250，并且source/target长度比超过1.5

这两个方法一共过滤掉了30%的数据。

# Monolingual
单语也适用langid进行过滤。
因为英俄方向俄语数据远少于英语或德语。为了从大的单语语料库Commoncrawl中选择高质量的领域数据，使用了Intelligent selection of language model training data方法。

System Overview

# base system
FAIRSEQ big Transformer
提高了embed dimension,FFN size, number of heads, number of layers，当保持网络大小可训练的情况下(while maintaining a manageable network size)，发现FFN size=8192有一些提升，后续的模型都使用larger FFN Transformer(FNN=8192)

# large-scale back-translation
通过训练target-to-source system去翻译单语数据，结合人工数据一起训练
> Understanding Back-Translation at Scale

we used back-translations obtained by sampling from an en-semble of three target-to-source models. 
发现集成模型要比单个模型的效果要好。
发现 1:1的真假数据性能最好。

# back-translating commoncrawl
俄语的新闻领域数据很少，所以使用领域数据过滤方法从大的开源数据集commoncrawl中提取与Newscrawl相似的数据。 We select a cutoff of 0.01, and use all sentences that score higher than this value for back-translation, or about 5% of the entire dataset
> Intelligent selection of language model training data

选择的方法

设领域数据 ,非领域数据 ,我们的目标是从找到 , 与的分布相似。对于给定的句子 ,可以计算s从中抽取的概率，

是一个常数可以忽略，加log项，公式中将替换为 ,我们假设了它们的分布是相同的。or after normalizing for length (sentence) ，我们可以根据训练语言模型，那么是句子s的word-normalized交叉熵评分，

# fine-tuning
在bitext and back-translated data训练完成后，又在少量领域数据进行了训练。（使用了之前几年的测试集，没有说微调多少步）

# noisy channel model rerank
target sequence y and source x

是一个常数，剩下的分别是forward model， channel model和language model。

权重选择0-2，长度惩罚选择0-1 （1，2都不包含）

使用Beam size=50 选择最好的 n-best，然后使用语言模型和信道模型根据wight和length penalty选择best one。

# postprocessing
后处理替换了固定的标点符号 quotation marks (“ ... ”) en
quotation marks ( ” ... “). de