## generalization

I think it’s valuable to be working in the open whenever possible, so I’m going to keep my research notes here. These notes will hopefully be full of good (and bad) ideas, so if someone borrows a good idea and publishes on it, that’s great!
This post contains my research notes as I try to understand how model scaling affects worst-group performance. This started as a group project in the neural scaling laws course at Mila in winter 2022.

Read moreThis was a paper we presented about in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. You can view the slides we used here, and the recording here.

The big idea here is to use the geometric mean instead of the arithmetic mean across samples in the batch when computing the gradient for SGD. This overcomes the situation where averaging produces optima that are not actually optimal for any individual samples, as demonstrated in their toy example below:
In practice, the method the authors test is not exactly the geometric mean for numerical and performance reasons, but effectively accomplishes the same thing by avoiding optima that are “inconsistent” (meaning that gradients from relatively few samples actually point in that direction).

Read moreThese authors define robust error as the least upper bound on the expected loss over a family of environmental settings (including dataset, model architecture, learning algorithm, etc.):
\[\sup_{e\in\mathcal F}\mathbb E_{\omega\in P^e}\left[\ell(\phi,\omega)\right]\]
The fact that this is an upper bound and not an average is very important and is what makes this work unique from previous work in this direction. Indeed, what we should be concerned about is not how poorly a model performs on the average sample but on the worst-case sample.

Read moreThis is a long paper, so a lot of my writing here is an attempt to condense the discussion. I’ve taken the liberty to pull exact phrases and structure from the paper without explicitly using quotes.
Our main hypothesis is that deep learning succeeded in part because of a set of inductive biases, but that additional ones should be added in order to go from good in-distribution generalization in highly supervised learning tasks (or where strong and dense rewards are available), such as object recognition in images, to strong out-of-distribution generalization and transfer learning to new tasks with low sample complexity.

Read more