What is all the hype about?

  • Large language model trained on a large diverse dataset performs well on different domains and tasks
  • Motivate multi-task learning and why narrow learning can cause problems
  • General method of transfer
  • A LM with sufficient capacity will begin to learn to infer and perform tasks demonstrated in natural language sequences in order to better predict them, regardless of method of procurement
  • LM performs multi-task unsupervised learning.
  • GPT-2 is underfitting the data at 1.5B parameters capacity.
  • Zero-shot performance of GPT-2 gives SOTA on 7 out of 8 language modelling datasets
  • Performs reasonably well, zero-shot on multiple different tasks ranging from baseline to SOTA
  • Questions
    • Perplexity?
    • Byte Pair encoding?
    • Transformer model?
    • Bloom filters?