These are my notes from research papers I read. Each page’s title is also a link to the abstract or PDF.
to read: Google AI: optimizing multiple loss functions Google AI: reducing gender bias in Google Translate Zoom In: An Introduction to Circuits Google AI: Neural Tangents Google AI: TensorFlow Quantum SLIDE (fast CPU training) Google AI: Reformer lottery ticket initialization Google AI: out-of-distribution detection Large-Scale Multilingual Speech Recognition with E2E model E2E ASR from raw waveform Machine Theory of Mind Normalizing Flows Glow networks A Theory of Local Learning, the Learning Channel, and the Optimality of Backpropagation Why and When Deep Networks Avoid the Curse of Dimensionality Diversity is All You Need (Learning Skills without a Reward Function) World Models Relational inductive biases, deep learning, and graph networks Loss Surfaces of Multilayer Networks Visualizing the Loss Landscape of Neural Nets The Matrix Calculus You Need for Deep Learning Group Normalization Layer Normalization Artificial Intelligence Meets Natural Stupidity Qualitatively characterizing neural network optimization problems Strong Inference A learning algorithm for continually running fully recurrent neural networks Adaptive multi-level hyper-gradient descent Rotate your networks: better weight consolidation and less catastrophic forgetting Attention is not all you need When BERT plays the lottery, all tickets are winning
Recent contextual word embeddings (e.g. ELMo) have shown to be much better than “static” embeddings (where there’s a one-to-one mapping from token to representation). This paper is exciting because they were able to create a multi-lingual embedding space that used contextual word embeddings.
Each token will have a “point cloud” of embedding values, one point for each context containing the token. They define the embedding anchor as the average of all those points for a particular token.
This is a long paper, so a lot of my writing here is an attempt to condense the discussion. I’ve taken the liberty to pull exact phrases and structure from the paper without explicitly using quotes.
Our main hypothesis is that deep learning succeeded in part because of a set of inductive biases, but that additional ones should be added in order to go from good in-distribution generalization in highly supervised learning tasks (or where strong and dense rewards are available), such as object recognition in images, to strong out-of-distribution generalization and transfer learning to new tasks with low sample complexity.
BERT optimizes the Masked Language Model (MLM) objective by masking word pieces uniformly at random in its training data and attempting to predict the masked values. With SpanBERT, spans of tokens are masked and the model is expected to predict the text in the spans from the representations of the words on the boundary. Span lengths follow a geometric distribution, and span start points are uniformly random.
To predict each individual masked token, a two-layer feedforward network was provided with the boundary token representations plus the position embedding of the target token, and the output vector representation was used to predict the masked token and compute cross-entropy loss exactly as in standard MLM.
This is the original paper introducing Embeddings from Language Models (ELMo).
Unlike most widely used word embeddings, ELMo word representations are functions of the entire input sentence
That’s what makes ELMo great: they’re contextualized word representations, meaning that they can express multiple possible senses of the same word.
Specifically, EMLo representations are a learned linear combination of all layers of an LSTM encoding. The LSTM undergoes general semi-supervized pretraining, but the linear combination is learned specific to the task.
In the paper they use Bayes' rule to show that the contribution of the first of two tasks is contained in the posterior distribution of model parameters over the first dataset. This is important because it means we can estimate that posterior to try to get a sense for which model parameters were most important for that first task.
In this paper, they perform that estimation using a multivariate Gaussian distribution.