deep-learning
These authors define robust error as the least upper bound on the expected loss over a family of environmental settings (including dataset, model architecture, learning algorithm, etc.):
\[\sup_{e\in\mathcal F}\mathbb E_{\omega\in P^e}\left[\ell(\phi,\omega)\right]\]
The fact that this is an upper bound and not an average is very important and is what makes this work unique from previous work in this direction. Indeed, what we should be concerned about is not how poorly a model performs on the average sample but on the worst-case sample.
Read moreWe presented this paper as a mini-lecture in Bang Liu’s IFT6289 course in winter 2022. You can view the slides we used here.
This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts.
Sometimes these scaling laws can feel like pseudoscience because they’re a post hoc attempt to place a trend line on data. How can we be confident that the trends we observe actually reflect the scaling laws that we’re after? In the limitations section they mention that they didn’t tune hyperparameters for fine-tuning or for the code data distribution. How can we know that a confounding hyperparameter is not responsible for the trend we see? I wonder if we aren’t really being statistically rigorous until we can predict generalization error on an unseen training setup, rather than just an unseen model size/dataset size.
Read moreThis was a paper we presented about in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. You can view the slides we used here.
It’s important to note that in the results for NMT (Figure 1) we would expect the lines in the graph on the left to curve as the capacity of the individual models is exhausted. That’s why the authors fit the curves with an extra constant added. Meanwhile, the results in the graph on the right are curved because as the data size grows, the optimal model size also grows and it becomes increasingly difficult to find the right hyperparameters to train the model down to the optimal generalization error. (See the last paragraph in Section 4.1.)
Read moreThis post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts.
In this paper they mention that the mask vector is learned, and it sounds like the positional embeddings are also learned. I remember in Attention is all you need they found that cosine positional embeddings worked better than learned ones, especially for sequences of longer length. But now it seems like most papers are doing learned embeddings. If anyone knows why, send me an email.
Read more