nlp
I also referred to this implementation to understand some of the details.
This is the paper describing the Transformer, a sequence-to-sequence model based entirely on attention. I think it’s best described with pictures.
model overview permalink From this picture, I think the following things need explaining:
embeddings these are learned embeddings that convert the input and output tokens to vectors of the model dimension. In this paper, they actually used the same weight matrix for input embedding, output embedding, and the final linear layer before the final softmax. positional encoding: since there’s no concept of a hidden state or convolution that encodes the order of the inputs, we have to add some information about the position of the tokens. They used a sinusoidal positional encoding that was a function of the position and the dimension. The wavelength for each dimension forms a geometric progression from \(2\pi\) to 10000 times that. the outputs are “shifted right” multi-head attention: see below for a description of multi-head attention. In the encoder-decoder attention layers, \(Q\) comes from the previous masked attention layer and \(K\) and \(V\) come from the output of the encoder. Everywhere else uses self-attention, meaning that \(Q\), \(K\), and \(V\) are all the same. masked multi-head attention: in the self-attention layers in the decoder, we can’t allow positions to attend to positions ahead of themselves, so we set all right-connecting values in the input of the softmax (right after scaling; see the image below) to negative infinity. feed-forward blocks these are two linear transformation with ReLU in between. The transformations are the same across each position, but they are different transformations from layer to layer, as you might expect. add & norm: these are residual connections followed by layer normalization. multi-head attention permalink The “Mask (opt.)” can be ignored because that’s for masked attention, described above.
Read moreThe B is for bidirectional, and that’s a big deal. It makes it possible to do well on sentence-level (NLI, question answering) and token-level tasks (NER, POS tagging). In a unidirectional model, the word “bank” in a sentence like “I made a bank deposit.” has only “I made a” as its context, keeping useful information from the model.
Another cool thing is masked language model training (MLM). They train the model by blanking certain words in the sentence and asking the model to guess the missing word.
Read moreThis model was superseded by this one.
They did some careful things with residual connections to make sure it was very parallelizable. They put each LSTM layer on a separate GPU. They quantized the models such that they could train using full floating-point computations with a couple restrictions and then convert the models to quantized versions.
They use the word-piece model from “Japanese and Korean Voice Search”, with 32,000 word pieces. (This is a lot less than the 200,000 used in that paper.) They state in the paper that the shared word-piece model is very similar to Byte-Pair-Encoding, which was used for NMT in this paper by researchers at U of Edinburgh.
The model and training process are exactly as in Google’s earlier paper. It takes 3 weeks on 100 GPUs to train, even after increasing batch size and learning rate.
Read moreThis was mentioned in the paper on Google’s Multilingual Neural Machine Translation System. It’s regarded as the original paper to use the word-piece model, which is the focus of my notes here.
the WordPieceModel permalink Here’s the WordPieceModel algorithm:
func WordPieceModel(D, chars, n, threshold) -> inventory: # D: training data # n: user-specified number of word units (often 200k) # chars: unicode characters used in the language (e.g. Kanji, Hiragana, Katakana, ASCII for Japanese) # threshold: stopping criterion for likelihood increase # inventory: the set of word units created by the model inventory := chars likelihood := +INF while len(inventory) < n && likelihood >= threshold: lm := LM(inventory, D) inventory += argmax_{combined word unit}(lm.likelihood_{inventory + combined word unit}(D)) likelihood = lm.likelihood_{inventory}(D) return inventory The algorithm can be optimized by
Read more