papers
These are my notes from research papers I read. Each page’s title is also a link to the abstract or PDF.
This is the original paper introducing Embeddings from Language Models (ELMo).
Unlike most widely used word embeddings, ELMo word representations are functions of the entire input sentence.
That’s what makes ELMo great: they’re contextualized word representations, meaning that they can express multiple possible senses of the same word.
Specifically, ELMo representations are a learned linear combination of all layers of an LSTM encoding. The LSTM undergoes general semi-supervised pretraining, but the linear combination is learned specific to the task. It’s been shown that initial layers in LSTM encoders are more representative of syntax, while later layers tend to represent semantics, so this linear combination is a key advantage that allows ELMo to improve accuracy on tasks ranging from POS tagging to question answering.
Read moreIn the paper they use Bayes’ rule to show that the contribution of the first of two tasks is contained in the posterior distribution of model parameters over the first dataset. This is important because it means we can estimate that posterior to try to get a sense for which model parameters were most important for that first task.
In this paper, they perform that estimation using a multivariate Gaussian distribution. The means are the values of the model parameters after training on the first dataset, and the precision (inverse of variance) is the values of the diagonals along the Fisher information matrix.
Read moreThis is a follow-on to A meta-transfer objective for learning to disentangle causal mechanisms
Here we describe an algorithm for predicting the causal graph structure of a set of visible random variables, each possibly causally dependent on any of the other variables.
the algorithm permalink There are two sets of parameters, the structural parameters and the functional parameters. The structural parameters compose a matrix where \(\sigma(\gamma_{ij})\) represents the belief that variable \(X_j\) is a direct cause of \(X_i\). The functional parameters are the parameters of the neural networks that model the conditional probability distribution of each random variable given its parent set.
Read moreTheoretically, models should be able to predict on out-of-distribution data if their understanding of causal relationships is correct. The toy problem they use in this paper is that of predicting temperature from altitude. If a model is trained on data from Switzerland, the model should ideally be able to correctly predict on data from the Netherlands, even though it hasn’t seen elevations that low before.
The main contribution of this paper is that they’ve found that models tend to transfer faster to a new distribution when they learn the correct causal relationships, and when those relationships are sparsely represented, meaning they are represented by relatively few nodes in the network. This allowed them to create a meta-learning objective that trains the model to represent the correct causal dependencies, allowing for improved generalization.
Read moreThe theoretical value in talking about the parameter-function map is that this map lets us talk about sets of parameters that produce the same function. In this paper they used some recently proven stuff from algorithmic information theory (AIT) to show that for neural networks the parameter-function map is biased toward functions with low Komolgorov complexity, meaning that simple functions are more likely to appear given random choice of parameters. Since real world problems are also biased toward simple functions, this could explain the generalization/memorization results found by Zhang et al.
Read more