Trying to understand the basic byte-pair encoding work

  • Translation is an open-vocabulary problem: any word that is not in your vocabulary can show up.
    • two solutions: ignore; back-off to a dictionary
  • New solution: subword models. various words are more translatable by smaller units than words
  • Use various word segmentation methods: n-gram models and byte-pair encodings
  • encoder-decoder rnn is used for this model
  • Hypothesis: segmentation of rare words into appropriate subword units is enough to allow for the NMT to learn translations to generalize and produce unseen words
  • Large portion of unknown words are name which can either be copied directly or have to be transliterated into the new translation
  • NMT different from phrase-based or SMT in that there are strong incentives to minimize vocabulary size to increase time and space efficiency
  • Byte pair encoding (BPE) is a data compression technique that iteratively replaces the most frequent pair of bytes in a sequence with a single unused byte.
    • for NMT: instead of bytes they merge characters or character sequences
    • do not consider pairs that cross word boundaries
    • v/s huffman encoding: symbol sequences are still interpretable as subword units
    • evaluate two methods of applying BPE: two separate for source and target, and one joint for both source and target
    • BPE is open-vocabulary and learned merge operations can be applied to the test set to obtain a segmentation with no unknown symbols
  • n-gram: each sequence is broken into tokens of length n. ex: for bi-gram the tokenization is to pair every two characters together.
  • get SOTA results for NMT models. still slightly worse the SMT models.
  • Result: a new method to show NMT can handle open vocab problems by using subword units. Better than using back-off translation.