data-pruning

Trivial or impossible—dichotomous data difficulty masks model differences (on ImageNet and beyond)

We observe that 48.2% [of] images [in ImageNet] are learned by all models regardless of their inductive bias; 14.3% [of] images are consistently misclassified by all models; only roughly a third (37.5%) of images are responsible for the differences between two models’ decisions. We call this phenomenon dichotomous data difficulty (DDD). The authors varied hyperparameters, optimizers, architectures, supervision modes, and sampling methods, finding that models only varied in performance on about a third of the images in the dataset. And this isn’t specific to ImageNet; they found similar results for CIFAR-100 and a synthetic Gaussian dataset. They use this measure to divide the dataset into “trivials”, “impossibles”, and “in-betweens”.
Read more

Beyond neural scaling laws: beating power law scaling via data pruning

In this paper they show that we can achieve exponential performance scaling over dataset size, when the samples added are pruned to be only the best examples. This beats power law scaling in a big way. There is still no free lunch, in some sense, because in most cases it will become progressively harder to add new useful samples as the dataset gets bigger. But this is a big deal for computation, because it means that the number of samples in the dataset is not nearly as important as the coverage and quality that the dataset provides.This also means that scaling laws for compute (usually expressed as a function of dataset and model size) are dataset-specific and not generalizable, because of how much sample quality affects data scaling.
Read more