GPT-1
The pre-cursor to GPT-2
- Different tasks in NLU; good results by generative pre-training of a language model on a large, diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task.
- Semi-supervised approach for language understanding tasks that is a combination of unsupervised pre-training and supervised fine-tuning.
- Goal: learn a universal representation that transfers with little adaptation to a wide range of tasks.
- Use a LM objective on unlabeled data to learn initial parameters of network. Adapt these parameters to a target task using corresponding supervised objective.
- Use transformer model: more structured memory for handling long-term dependencies compared to LSTM. Specifically, multi-layer Transformer decoder.
- Use an auxiliary training objective when training on the supervised task.
- Final finetuning objective $L_{final}(D_{labelled}) = L_{finetune}(D_{labelled}) + \lambda * L_{LM}(D_{labelled})$
- To transfer they transform the inputs for different tasks. They convert structured inputd into an ordered sequence
- Technical details:
- Byte-pair encodings
- GELU activation
- Cosine annealing
- BookCorpus dataset: contains long stretches of contiguous text.
- Impact of transferring more layers: each transformer layer added provides benefits up to 9%.
- Zero-shot behavior: underlying generative model learns to perform many of the tasks to improve modeling capability. transformer’s attentional memory assists transfer.
- Ablation studies:
- larger datasets benefit from the auxiliary objective compared to smaller ones
- transformer archticture enables better performance compared to LSTM
- without pre-training performance is much worse - 14.8% decrease
- Questions
- What’s GLUE?
- Mathews correlation and pearson correlation?
- Byte-pair encodings?