Generalization on Kyle Roth

"Low-resource" text classification: a parameter-free classification method with compressors

Mon, 24 Jul 2023 11:35:13 -0400

I presented this paper in Bang Liu’s research group meeting on 2023-07-24. You can view the slides I used here.

It seems like the authors made a mistake that inflated the scores for the multilingual experiments, according to Ken Schutte.

the effects of scale on worst-group performance

Mon, 18 Jul 2022 15:31:12 -0400

I think it’s valuable to be working in the open whenever possible, so I’m going to keep my research notes here. These notes will hopefully be full of good (and bad) ideas, so if someone borrows a good idea and publishes on it, that’s great!

This post contains my research notes as I try to understand how model scaling affects worst-group performance. This started as a group project in the neural scaling laws course at Mila in winter 2022. We presented about an existing paper and presented our preliminary results in class. The repository for this project is here.

The effect of model size on worst-group generalization

Thu, 17 Mar 2022 14:34:33 -0400

This was a paper we presented about in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. You can view the slides we used here, and the recording here.

Learning explanations that are hard to vary

Tue, 22 Feb 2022 12:29:17 -0500

The big idea here is to use the geometric mean instead of the arithmetic mean across samples in the batch when computing the gradient for SGD. This overcomes the situation where averaging produces optima that are not actually optimal for any individual samples, as demonstrated in their toy example below:

In practice, the method the authors test is not exactly the geometric mean for numerical and performance reasons, but effectively accomplishes the same thing by avoiding optima that are “inconsistent” (meaning that gradients from relatively few samples actually point in that direction).

In search of robust measures of generalization

Mon, 21 Feb 2022 15:33:22 -0500

These authors define robust error as the least upper bound on the expected loss over a family of environmental settings (including dataset, model architecture, learning algorithm, etc.):

\[\sup_{e\in\mathcal F}\mathbb E_{\omega\in P^e}\left[\ell(\phi,\omega)\right]\]

The fact that this is an upper bound and not an average is very important and is what makes this work unique from previous work in this direction. Indeed, what we should be concerned about is not how poorly a model performs on the average sample but on the worst-case sample.

Inductive biases for deep learning of higher-level cognition

Tue, 08 Dec 2020 06:40:48 -0700

This is a long paper, so a lot of my writing here is an attempt to condense the discussion. I’ve taken the liberty to pull exact phrases and structure from the paper without explicitly using quotes.

Our main hypothesis is that deep learning succeeded in part because of a set of inductive biases, but that additional ones should be added in order to go from good in-distribution generalization in highly supervised learning tasks (or where strong and dense rewards are available), such as object recognition in images, to strong out-of-distribution generalization and transfer learning to new tasks with low sample complexity.

Overcoming catastrophic forgetting in neural networks

Thu, 01 Oct 2020 10:47:28 -0600

In the paper they use Bayes’ rule to show that the contribution of the first of two tasks is contained in the posterior distribution of model parameters over the first dataset. This is important because it means we can estimate that posterior to try to get a sense for which model parameters were most important for that first task.

In this paper, they perform that estimation using a multivariate Gaussian distribution. The means are the values of the model parameters after training on the first dataset, and the precision (inverse of variance) is the values of the diagonals along the Fisher information matrix.

Learning neural causal models from unknown interventions

Tue, 22 Sep 2020 10:39:54 -0600

This is a follow-on to A meta-transfer objective for learning to disentangle causal mechanisms

Here we describe an algorithm for predicting the causal graph structure of a set of visible random variables, each possibly causally dependent on any of the other variables.

the algorithm

There are two sets of parameters, the structural parameters and the functional parameters. The structural parameters compose a matrix where \(\sigma(\gamma_{ij})\) represents the belief that variable \(X_j\) is a direct cause of \(X_i\). The functional parameters are the parameters of the neural networks that model the conditional probability distribution of each random variable given its parent set.

A meta-transfer objective for learning to disentangle causal mechanisms

Mon, 21 Sep 2020 08:46:30 -0600

Theoretically, models should be able to predict on out-of-distribution data if their understanding of causal relationships is correct. The toy problem they use in this paper is that of predicting temperature from altitude. If a model is trained on data from Switzerland, the model should ideally be able to correctly predict on data from the Netherlands, even though it hasn’t seen elevations that low before.

The main contribution of this paper is that they’ve found that models tend to transfer faster to a new distribution when they learn the correct causal relationships, and when those relationships are sparsely represented, meaning they are represented by relatively few nodes in the network. This allowed them to create a meta-learning objective that trains the model to represent the correct causal dependencies, allowing for improved generalization.

Deep learning generalizes because the parameter-function map is biased towards simple functions

Tue, 08 Sep 2020 07:29:09 -0600

The theoretical value in talking about the parameter-function map is that this map lets us talk about sets of parameters that produce the same function. In this paper they used some recently proven stuff from algorithmic information theory (AIT) to show that for neural networks the parameter-function map is biased toward functions with low Komolgorov complexity, meaning that simple functions are more likely to appear given random choice of parameters. Since real world problems are also biased toward simple functions, this could explain the generalization/memorization results found by Zhang et al.

A closer look at memorization in deep networks

Mon, 31 Aug 2020 11:52:35 -0600

This paper builds on what we learned in “Understanding deep learning requires rethinking generalization”. In that paper they showed that DNNs are able to fit pure noise in the same amount of time as it can fit real data, which means that our optimization algorithm (SGD, Adam, etc.) is not what’s keeping DNNs from overfitting.

experiments for detecting easy/hard samples

It looks like there are qualitative differences between a DNN that has memorized some data and a DNN that has seen real data. In experiments they found that real datasets contain “easy examples” that are more quickly learned than the hard examples. This is not the case for random data.

Why does unsupervised pre-training help deep learning?

Mon, 24 Aug 2020 11:40:00 -0600

They’re pretty sure that it performs regularization by starting off the supervised training in a good spot, instead of by somehow improving the optimization path.

The consciousness prior

Fri, 14 Aug 2020 09:05:56 -0700

System 1 cognitive abilities are about low-level perception and intuitive knowledge. System 2 cognitive abilities can be described verbally, and include things like reasoning, planning, and imagination. In cognitive neuroscience, the “Global Workspace Theory” says that at each moment specific pieces of information become a part of working memory and become globally available to other unconscious computational processes. Relative to the unconscious state, the conscious state is low-dimensional, focusing on a few things. The paper proposes we use an attention mechanism (in the sense of the Bahdanau 2015 paper) to produce the conscious state, and then a VAE or conditional GAN to produce the output from the conscious state.

Compositional generalization by factorizing alignment and translation

Mon, 27 Jul 2020 09:11:16 -0700

They had a biRNN with attention for alignment encoding, and then a single linear function of each one-hot encoded word for encoding that single word. Their reasoning was that by separating the alignment from the meaning of individual words the model could more easily generalize to unseen words.