Selective annotation makes language models better few-shot learners

Selective annotation chooses a pool of samples to annotate from a large set of unlabeled data. The main result of the paper is that when this is combined with item-specific prompt retrieval the performance drastically improves (>10% relative gain and lower performance variance). Interestingly, selective annotation does not help for finetuning, or when the prompts are randomly selected. They call their selective annotation method “vote-\(k\)”. selective annotation method permalink Vote-\(k\) essentially creates a network of similaraccording to Sentence-BERT unlabeled instances, and then selects from them with a network importance score that is discounted to promote diversityThe discounting is performed by iteratively adding to the selection set, each time penalizing new nodes for being close to nodes that are already in the selection set.
Read more

Trivial or impossible—dichotomous data difficulty masks model differences (on ImageNet and beyond)

We observe that 48.2% [of] images [in ImageNet] are learned by all models regardless of their inductive bias; 14.3% [of] images are consistently misclassified by all models; only roughly a third (37.5%) of images are responsible for the differences between two models’ decisions. We call this phenomenon dichotomous data difficulty (DDD). The authors varied hyperparameters, optimizers, architectures, supervision modes, and sampling methods, finding that models only varied in performance on about a third of the images in the dataset.
Read more

Beyond neural scaling laws: beating power law scaling via data pruning

In this paper they show that we can achieve exponential performance scaling over dataset size, when the samples added are pruned to be only the best examples. This beats power law scaling in a big way. There is still no free lunch, in some sense, because in most cases it will become progressively harder to add new useful samples as the dataset gets bigger. But this is a big deal for computation, because it means that the number of samples in the dataset is not nearly as important as the coverage and quality that the dataset provides.
Read more

LocoProp: enhancing backprop via local loss optimization

This was a paper I presented about in Bang Liu’s research group meeting on 2022-08-05. You can view the slides I used here.

the effects of scale on worst-group performance

I think it’s valuable to be working in the open whenever possible, so I’m going to keep my research notes here. These notes will hopefully be full of good (and bad) ideas, so if someone borrows a good idea and publishes on it, that’s great! This post contains my research notes as I try to understand how model scaling affects worst-group performance. This started as a group project in the neural scaling laws course at Mila in winter 2022.
Read more