NMT of Rare words with Subword units
Trying to understand the basic byte-pair encoding work
- Translation is an open-vocabulary problem: any word that is not in your vocabulary can show up.
- two solutions: ignore; back-off to a dictionary
- New solution: subword models. various words are more translatable by smaller units than words
- Use various word segmentation methods: n-gram models and byte-pair encodings
- encoder-decoder rnn is used for this model
- Hypothesis: segmentation of rare words into appropriate subword units is enough to allow for the NMT to learn translations to generalize and produce unseen words
- Large portion of unknown words are name which can either be copied directly or have to be transliterated into the new translation
- NMT different from phrase-based or SMT in that there are strong incentives to minimize vocabulary size to increase time and space efficiency
- Byte pair encoding (BPE) is a data compression technique that iteratively replaces the most frequent pair of bytes in a sequence with a single unused byte.
- for NMT: instead of bytes they merge characters or character sequences
- do not consider pairs that cross word boundaries
- v/s huffman encoding: symbol sequences are still interpretable as subword units
- evaluate two methods of applying BPE: two separate for source and target, and one joint for both source and target
- BPE is open-vocabulary and learned merge operations can be applied to the test set to obtain a segmentation with no unknown symbols
- n-gram: each sequence is broken into tokens of length
n
. ex: for bi-gram the tokenization is to pair every two characters together. - get SOTA results for NMT models. still slightly worse the SMT models.
- Result: a new method to show NMT can handle open vocab problems by using subword units. Better than using back-off translation.