generalization on Kyle Rothhttps://kylrth.com/tags/generalization/Recent content in generalization on Kyle RothHugo -- gohugo.ioen-usMon, 18 Jul 2022 15:31:12 -0400the effects of scale on worst-group performancehttps://kylrth.com/post/worst-group-scale/Mon, 18 Jul 2022 15:31:12 -0400https://kylrth.com/post/worst-group-scale/I think it’s valuable to be working in the open whenever possible, so I’m going to keep my research notes here. These notes will hopefully be full of good (and bad) ideas, so if someone borrows a good idea and publishes on it, that’s great!
This post contains my research notes as I try to understand how model scaling affects worst-group performance. This started as a group project in the neural scaling laws course at Mila in winter 2022.The effect of model size on worst-group generalizationhttps://kylrth.com/paper/effect-of-model-size-on-worst-group-generalization/Thu, 17 Mar 2022 14:34:33 -0400https://kylrth.com/paper/effect-of-model-size-on-worst-group-generalization/This was a paper we presented about in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. You can view the slides we used here, and the recording here.Learning explanations that are hard to varyhttps://kylrth.com/paper/learning-explanations-hard-to-vary/Tue, 22 Feb 2022 12:29:17 -0500https://kylrth.com/paper/learning-explanations-hard-to-vary/The big idea here is to use the geometric mean instead of the arithmetic mean across samples in the batch when computing the gradient for SGD. This overcomes the situation where averaging produces optima that are not actually optimal for any individual samples, as demonstrated in their toy example below:
In practice, the method the authors test is not exactly the geometric mean for numerical and performance reasons, but effectively accomplishes the same thing by avoiding optima that are “inconsistent” (meaning that gradients from relatively few samples actually point in that direction).In search of robust measures of generalizationhttps://kylrth.com/paper/robust-measures-of-generalization/Mon, 21 Feb 2022 15:33:22 -0500https://kylrth.com/paper/robust-measures-of-generalization/These authors define robust error as the least upper bound on the expected loss over a family of environmental settings (including dataset, model architecture, learning algorithm, etc.):
\[\sup_{e\in\mathcal F}\mathbb E_{\omega\in P^e}\left[\ell(\phi,\omega)\right]\]
The fact that this is an upper bound and not an average is very important and is what makes this work unique from previous work in this direction. Indeed, what we should be concerned about is not how poorly a model performs on the average sample but on the worst-case sample.Inductive biases for deep learning of higher-level cognitionhttps://kylrth.com/paper/inductive-biases-higher-cognition/Tue, 08 Dec 2020 06:40:48 -0700https://kylrth.com/paper/inductive-biases-higher-cognition/This is a long paper, so a lot of my writing here is an attempt to condense the discussion. I’ve taken the liberty to pull exact phrases and structure from the paper without explicitly using quotes.
Our main hypothesis is that deep learning succeeded in part because of a set of inductive biases, but that additional ones should be added in order to go from good in-distribution generalization in highly supervised learning tasks (or where strong and dense rewards are available), such as object recognition in images, to strong out-of-distribution generalization and transfer learning to new tasks with low sample complexity.Overcoming catastrophic forgetting in neural networkshttps://kylrth.com/paper/overcoming-catastrophic-forgetting/Thu, 01 Oct 2020 10:47:28 -0600https://kylrth.com/paper/overcoming-catastrophic-forgetting/In the paper they use Bayes’ rule to show that the contribution of the first of two tasks is contained in the posterior distribution of model parameters over the first dataset. This is important because it means we can estimate that posterior to try to get a sense for which model parameters were most important for that first task.
In this paper, they perform that estimation using a multivariate Gaussian distribution.Learning neural causal models from unknown interventionshttps://kylrth.com/paper/neural-causal-models/Tue, 22 Sep 2020 10:39:54 -0600https://kylrth.com/paper/neural-causal-models/This is a follow-on to A meta-transfer objective for learning to disentangle causal mechanisms
Here we describe an algorithm for predicting the causal graph structure of a set of visible random variables, each possibly causally dependent on any of the other variables.
the algorithm permalink There are two sets of parameters, the structural parameters and the functional parameters. The structural parameters compose a matrix where \(\sigma(\gamma_{ij})\) represents the belief that variable \(X_j\) is a direct cause of \(X_i\).A meta-transfer objective for learning to disentangle causal mechanismshttps://kylrth.com/paper/meta-transfer-objective-for-causal-mechanisms/Mon, 21 Sep 2020 08:46:30 -0600https://kylrth.com/paper/meta-transfer-objective-for-causal-mechanisms/Theoretically, models should be able to predict on out-of-distribution data if their understanding of causal relationships is correct. The toy problem they use in this paper is that of predicting temperature from altitude. If a model is trained on data from Switzerland, the model should ideally be able to correctly predict on data from the Netherlands, even though it hasn’t seen elevations that low before.
The main contribution of this paper is that they’ve found that models tend to transfer faster to a new distribution when they learn the correct causal relationships, and when those relationships are sparsely represented, meaning they are represented by relatively few nodes in the network.Deep learning generalizes because the parameter-function map is biased towards simple functionshttps://kylrth.com/paper/parameter-function-map-biased-to-simple/Tue, 08 Sep 2020 07:29:09 -0600https://kylrth.com/paper/parameter-function-map-biased-to-simple/The theoretical value in talking about the parameter-function map is that this map lets us talk about sets of parameters that produce the same function. In this paper they used some recently proven stuff from algorithmic information theory (AIT) to show that for neural networks the parameter-function map is biased toward functions with low Komolgorov complexity, meaning that simple functions are more likely to appear given random choice of parameters. Since real world problems are also biased toward simple functions, this could explain the generalization/memorization results found by Zhang et al.A closer look at memorization in deep networkshttps://kylrth.com/paper/closer-look-at-memorization/Mon, 31 Aug 2020 11:52:35 -0600https://kylrth.com/paper/closer-look-at-memorization/This paper builds on what we learned in “Understanding deep learning requires rethinking generalization”. In that paper they showed that DNNs are able to fit pure noise in the same amount of time as it can fit real data, which means that our optimization algorithm (SGD, Adam, etc.) is not what’s keeping DNNs from overfitting.
experiments for detecting easy/hard samples permalink It looks like there are qualitative differences between a DNN that has memorized some data and a DNN that has seen real data.Why does unsupervised pre-training help deep learning?https://kylrth.com/paper/why-unsupervised-helps/Mon, 24 Aug 2020 11:40:00 -0600https://kylrth.com/paper/why-unsupervised-helps/They’re pretty sure that it performs regularization by starting off the supervised training in a good spot, instead of by somehow improving the optimization path.The consciousness priorhttps://kylrth.com/paper/consciousness-prior/Fri, 14 Aug 2020 09:05:56 -0700https://kylrth.com/paper/consciousness-prior/System 1 cognitive abilities are about low-level perception and intuitive knowledge. System 2 cognitive abilities can be described verbally, and include things like reasoning, planning, and imagination. In cognitive neuroscience, the “Global Workspace Theory” says that at each moment specific pieces of information become a part of working memory and become globally available to other unconscious computational processes. Relative to the unconscious state, the conscious state is low-dimensional, focusing on a few things.Compositional generalization by factorizing alignment and translationhttps://kylrth.com/paper/factorizing-alignment-and-translation/Mon, 27 Jul 2020 09:11:16 -0700https://kylrth.com/paper/factorizing-alignment-and-translation/They had a biRNN with attention for alignment encoding, and then a single linear function of each one-hot encoded word for encoding that single word. Their reasoning was that by separating the alignment from the meaning of individual words the model could more easily generalize to unseen words.