papers

These are my notes from research papers I read. Each page’s title is also a link to the abstract or PDF.

Beyond neural scaling laws: beating power law scaling via data pruning

In this paper they show that we can achieve exponential performance scaling over dataset size, when the samples added are pruned to be only the best examples. This beats power law scaling in a big way. There is still no free lunch, in some sense, because in most cases it will become progressively harder to add new useful samples as the dataset gets bigger. But this is a big deal for computation, because it means that the number of samples in the dataset is not nearly as important as the coverage and quality that the dataset provides.This also means that scaling laws for compute (usually expressed as a function of dataset and model size) are dataset-specific and not generalizable, because of how much sample quality affects data scaling.
Read more

LocoProp: enhancing backprop via local loss optimization

This was a paper I presented about in Bang Liu’s research group meeting on 2022-08-05. You can view the slides I used here.

Continual-T0: progressively instructing 50+ tasks to language models without forgetting

This was a paper I presented about in Bang Liu’s research group meeting on 2022-06-06. You can view the slides I used here. Continual-T0 (CT0) extends T0 by progressively training it on 8 unseen language generation tasks, while retaining a replay buffer of 1% of the original training data to preserve performance. The result is a model that maintains nearly all of its performance on previous tasks while learning the new tasks. In addition, CT0 maintains the original T0’s performance on unseen tasks (which is a big deal because those tasks could not appear in the replay buffer) and it extends the compositionality of T0 to even more unseen tasks.
Read more

Multitask prompted training enables zero-shot task generalization (T0)

T0 builds on T5 by fine-tuning on more natural prompts and testing the model’s generalization to held-out tasks. Compare the training format diagrams for T5 (top) and T0 (bottom): Intuitively, the T0 prompts are more likely to be similar to implicit/explicit prompting that’s present in the pretraining data. The authors created several prompts for each dataset.
Read more

PaLM

This was a paper I presented about in Bang Liu’s research group meeting on 2022-04-11. You can view the slides I used here.