Machine translation transformer output - "unknown" tokens?

981 Views Asked by At

When decoding / translating a test dataset after training on the base Transformer model (Vaswani et. al.), I sometimes see this token "unk" in the ouput.

"unk" here refers to an unknown token, but my question is what is the reasoning behind that? Based on https://nlp.stanford.edu/pubs/acl15_nmt.pdf, does it mean that the vocab I built for the training set does not contain the word present in the test set?

For reference, I built the Vocab using Spacy en_core_web_sm and de_core_news_sm for a German to English translation task.

Example output:

ground truth = ['a', 'girl', 'in', 'a', 'jean', 'dress', 'is', 'walking', 'along', 'a', 'raised', 'balance', 'beam', '.']

predicted = ['a', 'girl', 'in', 'a', '<unk>', 'costume', 'is', 'jumping', 'on', 'a', 'clothesline', '.', '<eos>']

As you can see, the jean is "unk" here.

1

There are 1 best solutions below

1
On BEST ANSWER

Neural machine translation models have a limited vocabulary. The reason is that you get the distribution over the target vocabulary tokens by multiplying the hidden state of the encoder by a matrix that has one row for each vocabulary token. The paper that you mention uses hidden state of 1000 dimensions. If you wanted to cover English reasonably, you would need a vocabulary of at least 200k tokens, which would mean 800MB only for this matrix.

The paper that you mention is an outdated solution from 2015 and tries to find how to have the vocabulary as big as possible. However, increasing the vocabulary capacity did not appear to be the best solution because, with increasing vocabulary size, you add rarer and rarer words into the vocabulary and there is less and less training signal for embeddings of these words, so the model eventually does not learn to use those words properly.

State-of-the art machine translation uses a segmentation into subwords that was introduced in 2016 with the BPE algorithm. In parallel, Google came with an alternative solution named WordPiece for their first production neural machine translation system. Later, Google came with an improved segmentation algorithm SentencePiece in 2018.

The main principle of the subword vocabulary is that the frequent words remain intact, whereas rarer words get segmented into smaller units. Rare words are often proper names that do not really get translated. For languages with complex morphology, subword segmentation allows the models to learn how to create different forms of the same words.