Attention is all you need
I also referred to this implementation to understand some of the details.
This is the paper describing the Transformer, a sequence-to-sequence model based entirely on attention. I think it’s best described with pictures.
model overview permalink From this picture, I think the following things need explaining:
embeddings these are learned embeddings that convert the input and output tokens to vectors of the model dimension. In this paper, they actually used the same weight matrix for input embedding, output embedding, and the final linear layer before the final softmax. positional encoding: since there’s no concept of a hidden state or convolution that encodes the order of the inputs, we have to add some information about the position of the tokens. They used a sinusoidal positional encoding that was a function of the position and the dimension. The wavelength for each dimension forms a geometric progression from \(2\pi\) to 10000 times that. the outputs are “shifted right” multi-head attention: see below for a description of multi-head attention. In the encoder-decoder attention layers, \(Q\) comes from the previous masked attention layer and \(K\) and \(V\) come from the output of the encoder. Everywhere else uses self-attention, meaning that \(Q\), \(K\), and \(V\) are all the same. masked multi-head attention: in the self-attention layers in the decoder, we can’t allow positions to attend to positions ahead of themselves, so we set all right-connecting values in the input of the softmax (right after scaling; see the image below) to negative infinity. feed-forward blocks these are two linear transformation with ReLU in between. The transformations are the same across each position, but they are different transformations from layer to layer, as you might expect. add & norm: these are residual connections followed by layer normalization. multi-head attention permalink The “Mask (opt.)” can be ignored because that’s for masked attention, described above.
Read more