GPT-1

The pre-cursor to GPT-2

Different tasks in NLU; good results by generative pre-training of a language model on a large, diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task.
Semi-supervised approach for language understanding tasks that is a combination of unsupervised pre-training and supervised fine-tuning.
Goal: learn a universal representation that transfers with little adaptation to a wide range of tasks.
Use a LM objective on unlabeled data to learn initial parameters of network. Adapt these parameters to a target task using corresponding supervised objective.
Use transformer model: more structured memory for handling long-term dependencies compared to LSTM. Specifically, multi-layer Transformer decoder.
Use an auxiliary training objective when training on the supervised task.
Final finetuning objective $L_{final}(D_{labelled}) = L_{finetune}(D_{labelled}) + \lambda * L_{LM}(D_{labelled})$
To transfer they transform the inputs for different tasks. They convert structured inputd into an ordered sequence
Technical details:
- Byte-pair encodings
- GELU activation
- Cosine annealing
- BookCorpus dataset: contains long stretches of contiguous text.
Impact of transferring more layers: each transformer layer added provides benefits up to 9%.
Zero-shot behavior: underlying generative model learns to perform many of the tasks to improve modeling capability. transformer’s attentional memory assists transfer.
Ablation studies:
- larger datasets benefit from the auxiliary objective compared to smaller ones
- transformer archticture enables better performance compared to LSTM
- without pre-training performance is much worse - 14.8% decrease
Questions
- What’s GLUE?
- Mathews correlation and pearson correlation?
- Byte-pair encodings?