papers

These are my notes from research papers I read. Each page’s title is also a link to the abstract or PDF.

The consciousness prior

System 1 cognitive abilities are about low-level perception and intuitive knowledge. System 2 cognitive abilities can be described verbally, and include things like reasoning, planning, and imagination. In cognitive neuroscience, the “Global Workspace Theory” says that at each moment specific pieces of information become a part of working memory and become globally available to other unconscious computational processes. Relative to the unconscious state, the conscious state is low-dimensional, focusing on a few things. The paper proposes we use an attention mechanism (in the sense of the Bahdanau 2015 paper) to produce the conscious state, and then a VAE or conditional GAN to produce the output from the conscious state.
Read more

Troubling trends in machine learning scholarship

The authors discuss four trends in AI research that have negative consequences for the community. problems permalink explanation vs. speculation permalink It’s important to allow researchers to include speculation, because speculation is what allows ideas to form. But the paper has to carefully couch speculation inside a “Motivations” section or other verbage to ensure the reader understands its place. It’s extremely important to define concepts before using them. Terms like internal covariate shift or coverage sound like definitions without actually being such.
Read more

Attention is all you need

I also referred to this implementation to understand some of the details. This is the paper describing the Transformer, a sequence-to-sequence model based entirely on attention. I think it’s best described with pictures. model overview permalink From this picture, I think the following things need explaining: embeddings these are learned embeddings that convert the input and output tokens to vectors of the model dimension. In this paper, they actually used the same weight matrix for input embedding, output embedding, and the final linear layer before the final softmax. positional encoding: since there’s no concept of a hidden state or convolution that encodes the order of the inputs, we have to add some information about the position of the tokens. They used a sinusoidal positional encoding that was a function of the position and the dimension. The wavelength for each dimension forms a geometric progression from \(2\pi\) to 10000 times that. the outputs are “shifted right” multi-head attention: see below for a description of multi-head attention. In the encoder-decoder attention layers, \(Q\) comes from the previous masked attention layer and \(K\) and \(V\) come from the output of the encoder. Everywhere else uses self-attention, meaning that \(Q\), \(K\), and \(V\) are all the same. masked multi-head attention: in the self-attention layers in the decoder, we can’t allow positions to attend to positions ahead of themselves, so we set all right-connecting values in the input of the softmax (right after scaling; see the image below) to negative infinity. feed-forward blocks these are two linear transformation with ReLU in between. The transformations are the same across each position, but they are different transformations from layer to layer, as you might expect. add & norm: these are residual connections followed by layer normalization. multi-head attention permalink The “Mask (opt.)” can be ignored because that’s for masked attention, described above.
Read more

BERT: pre-training of deep bidirectional transformers for language understanding

The B is for bidirectional, and that’s a big deal. It makes it possible to do well on sentence-level (NLI, question answering) and token-level tasks (NER, POS tagging). In a unidirectional model, the word “bank” in a sentence like “I made a bank deposit.” has only “I made a” as its context, keeping useful information from the model. Another cool thing is masked language model training (MLM). They train the model by blanking certain words in the sentence and asking the model to guess the missing word.
Read more

Compositional generalization by factorizing alignment and translation

They had a biRNN with attention for alignment encoding, and then a single linear function of each one-hot encoded word for encoding that single word. Their reasoning was that by separating the alignment from the meaning of individual words the model could more easily generalize to unseen words.