The pre-cursor to GPT-2

  • Different tasks in NLU; good results by generative pre-training of a language model on a large, diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task.
  • Semi-supervised approach for language understanding tasks that is a combination of unsupervised pre-training and supervised fine-tuning.
  • Goal: learn a universal representation that transfers with little adaptation to a wide range of tasks.
  • Use a LM objective on unlabeled data to learn initial parameters of network. Adapt these parameters to a target task using corresponding supervised objective.
  • Use transformer model: more structured memory for handling long-term dependencies compared to LSTM. Specifically, multi-layer Transformer decoder.
  • Use an auxiliary training objective when training on the supervised task.
  • Final finetuning objective $L_{final}(D_{labelled}) = L_{finetune}(D_{labelled}) + \lambda * L_{LM}(D_{labelled})$
  • To transfer they transform the inputs for different tasks. They convert structured inputd into an ordered sequence
  • Technical details:
    • Byte-pair encodings
    • GELU activation
    • Cosine annealing
    • BookCorpus dataset: contains long stretches of contiguous text.
  • Impact of transferring more layers: each transformer layer added provides benefits up to 9%.
  • Zero-shot behavior: underlying generative model learns to perform many of the tasks to improve modeling capability. transformer’s attentional memory assists transfer.
  • Ablation studies:
    • larger datasets benefit from the auxiliary objective compared to smaller ones
    • transformer archticture enables better performance compared to LSTM
    • without pre-training performance is much worse - 14.8% decrease
  • Questions
    • What’s GLUE?
    • Mathews correlation and pearson correlation?
    • Byte-pair encodings?