## papers

These are my notes from research papers I read. Each page’s title is also a link to the abstract or PDF.

The unsurprising result here is that few-shot performance scales predictably with pre-training dataset size under traditional fine-tuning, matching network, and prototypical network approaches.
The interesting result is that the exponents of these three approaches were substantially different (see Table 1 in the paper), which says to me that the few-shot inference approach matters a lot.
The surprising result was that while more training on the “non-natural” Omniglot dataset did not improve few-shot accuracy on other datasets, training on “natural” datasets did improve accuracy on few-shot Omniglot.

Read moreThe big idea here is to use the geometric mean instead of the arithmetic mean across samples in the batch when computing the gradient for SGD. This overcomes the situation where averaging produces optima that are not actually optimal for any individual samples, as demonstrated in their toy example below:
In practice, the method the authors test is not exactly the geometric mean for numerical and performance reasons, but effectively accomplishes the same thing by avoiding optima that are “inconsistent” (meaning that gradients from relatively few samples actually point in that direction).

Read moreThese authors define robust error as the least upper bound on the expected loss over a family of environmental settings (including dataset, model architecture, learning algorithm, etc.):
\[\sup_{e\in\mathcal F}\mathbb E_{\omega\in P^e}\left[\ell(\phi,\omega)\right]\]
The fact that this is an upper bound and not an average is very important and is what makes this work unique from previous work in this direction. Indeed, what we should be concerned about is not how poorly a model performs on the average sample but on the worst-case sample.

Read moreWe presented this paper as a mini-lecture in Bang Liu’s IFT6289 course in winter 2022. You can view the slides we used here.

This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts.
Sometimes these scaling laws can feel like pseudoscience because they’re a post hoc attempt to place a trend line on data. How can we be confident that the trends we observe actually reflect the scaling laws that we’re after? In the limitations section they mention that they didn’t tune hyperparameters for fine-tuning or for the code data distribution.

Read more