NMT of Rare words with Subword units

Trying to understand the basic byte-pair encoding work

Translation is an open-vocabulary problem: any word that is not in your vocabulary can show up.
- two solutions: ignore; back-off to a dictionary
New solution: subword models. various words are more translatable by smaller units than words
Use various word segmentation methods: n-gram models and byte-pair encodings
encoder-decoder rnn is used for this model
Hypothesis: segmentation of rare words into appropriate subword units is enough to allow for the NMT to learn translations to generalize and produce unseen words
Large portion of unknown words are name which can either be copied directly or have to be transliterated into the new translation
NMT different from phrase-based or SMT in that there are strong incentives to minimize vocabulary size to increase time and space efficiency
Byte pair encoding (BPE) is a data compression technique that iteratively replaces the most frequent pair of bytes in a sequence with a single unused byte.
- for NMT: instead of bytes they merge characters or character sequences
- do not consider pairs that cross word boundaries
- v/s huffman encoding: symbol sequences are still interpretable as subword units
- evaluate two methods of applying BPE: two separate for source and target, and one joint for both source and target
- BPE is open-vocabulary and learned merge operations can be applied to the test set to obtain a segmentation with no unknown symbols
n-gram: each sequence is broken into tokens of length n. ex: for bi-gram the tokenization is to pair every two characters together.
get SOTA results for NMT models. still slightly worse the SMT models.
Result: a new method to show NMT can handle open vocab problems by using subword units. Better than using back-off translation.