neural-scaling

It's not just size that matters: small language models are also few-shot learners

We presented this paper as a mini-lecture in Bang Liu’s IFT6289 course in winter 2022. You can view the slides we used here.

Scaling laws for transfer

This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. Sometimes these scaling laws can feel like pseudoscience because they’re a post hoc attempt to place a trend line on data. How can we be confident that the trends we observe actually reflect the scaling laws that we’re after? In the limitations section they mention that they didn’t tune hyperparameters for fine-tuning or for the code data distribution. How can we know that a confounding hyperparameter is not responsible for the trend we see? I wonder if we aren’t really being statistically rigorous until we can predict generalization error on an unseen training setup, rather than just an unseen model size/dataset size.
Read more

Deep learning scaling is predictable, empirically

This was a paper we presented about in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. You can view the slides we used here. It’s important to note that in the results for NMT (Figure 1) we would expect the lines in the graph on the left to curve as the capacity of the individual models is exhausted. That’s why the authors fit the curves with an extra constant added. Meanwhile, the results in the graph on the right are curved because as the data size grows, the optimal model size also grows and it becomes increasingly difficult to find the right hyperparameters to train the model down to the optimal generalization error. (See the last paragraph in Section 4.1.)
Read more

Data scaling laws in NMT: the effect of noise and architecture

This paper is all about trying a bunch of different changes to the training setup to see what affects the power law exponent over dataset size. Here are some of the answers: encoder-decoder size asymmetry: exponent not affected, but effective model capacity affected architecture (LSTM vs. Transformer): exponent not affected, but effective model capacity affected dataset quality (filtered vs. not): exponent and effective model capacity not effected, losses on smaller datasets affected dataset source (ParaCrawl vs. in-house dataset): exponent not affected adding independent noise: exponent not affected, but effective model capacity affected Here are some other things to test that I thought of while I read this:
Read more

Parallel training of deep networks with local updates

This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. Once I learned how the loss functions worked for each chunk, my first question was whether the earlier chunks were going to be able to learn the low-level features that later chunks would need. Figure 7 seems to show that they do, although their quality apparently decreases with increasingly local updates.
Read more