Transformer: attention is all you need

Let’s get transforming!

New sequence transduction model solely based on attention mechanisms
Goal: Superior translation quality + less training time; make parallelizable
Idea: Attention mechanism used to draw global dependencies between input and output. Replace recurrent layers with multi-headed self attention.
Self-attention relates different position of a single sequence in order to compute a representation of the sequence.
Number of operations required to relate two arbitrary input and output positions is a constant unlike recurrent or convolutional architectures
An attention function is a mapping from a query and key-value pair to an output. Query, key, value and output are all vectors.
- Output is computed as weighted sum of values where a particular value’s weight is given by compatibility function of the query with the corresponding key.
- $Attention(q, k, v) = w^\top v$ and $w_i = compat(q, k_i)$
- Rewritten as $Attention(q, k, v) = compat(q, k)^\top v$
- This paper uses a softmax as a compat function, and they scale it by a factor
Architecture involves positional encoding, attention sublayers, feed-forward sublayers. residual connections and layer norms.
Claim: their model is better and more efficient.
Side Claim: Self-attention could also yield more interpretable results
Questions
- What’s a sequence transduction model?
- How did they get the scaling factor?
- What’s a BLEU?
- What’s a sinusoid?
- What’s byte-pair encoding?
- Interesting learning-rate scheduler
- What’s multi-headed self attention?