## generalization

This is a long paper, so a lot of my writing here is an attempt to condense the discussion. I’ve taken the liberty to pull exact phrases and structure from the paper without explicitly using quotes.
Our main hypothesis is that deep learning succeeded in part because of a set of inductive biases, but that additional ones should be added in order to go from good in-distribution generalization in highly supervised learning tasks (or where strong and dense rewards are available), such as object recognition in images, to strong out-of-distribution generalization and transfer learning to new tasks with low sample complexity.

Read moreIn the paper they use Bayes’ rule to show that the contribution of the first of two tasks is contained in the posterior distribution of model parameters over the first dataset. This is important because it means we can estimate that posterior to try to get a sense for which model parameters were most important for that first task.
In this paper, they perform that estimation using a multivariate Gaussian distribution.

Read moreThis is a follow-on to A meta-transfer objective for learning to disentangle causal mechanisms
Here we describe an algorithm for predicting the causal graph structure of a set of visible random variables, each possibly causally dependent on any of the other variables.
the algorithm permalink There are two sets of parameters, the structural parameters and the functional parameters. The structural parameters compose a matrix where \(\sigma(\gamma_{ij})\) represents the belief that variable \(X_j\) is a direct cause of \(X_i\).

Read moreTheoretically, models should be able to predict on out-of-distribution data if their understanding of causal relationships is correct. The toy problem they use in this paper is that of predicting temperature from altitude. If a model is trained on data from Switzerland, the model should ideally be able to correctly predict on data from the Netherlands, even though it hasn’t seen elevations that low before.
The main contribution of this paper is that they’ve found that models tend to transfer faster to a new distribution when they learn the correct causal relationships, and when those relationships are sparsely represented, meaning they are represented by relatively few nodes in the network.

Read moreThe theoretical value in talking about the parameter-function map is that this map lets us talk about sets of parameters that produce the same function. In this paper they used some recently proven stuff from algorithmic information theory (AIT) to show that for neural networks the parameter-function map is biased toward functions with low Komolgorov complexity, meaning that simple functions are more likely to appear given random choice of parameters. Since real world problems are also biased toward simple functions, this could explain the generalization/memorization results found by Zhang et al.

Read more