papers

These are my notes from research papers I read. Each page’s title is also a link to the abstract or PDF.

A closer look at memorization in deep networks

This paper builds on what we learned in “Understanding deep learning requires rethinking generalization”. In that paper they showed that DNNs are able to fit pure noise in the same amount of time as it can fit real data, which means that our optimization algorithm (SGD, Adam, etc.) is not what’s keeping DNNs from overfitting. experiments for detecting easy/hard samples permalink It looks like there are qualitative differences between a DNN that has memorized some data and a DNN that has seen real data. In experiments they found that real datasets contain “easy examples” that are more quickly learned than the hard examples. This is not the case for random data.
Read more

A disciplined approach to neural network hyperparameters: part 1

The goal of hyperparameter tuning is to reach the point where test loss is horizontal on the graph over model complexity. Underfitting can be observed with a small learning rate, simple architecture, or complex data distribution. You can observe underfitting decrease by seeing more drastic results at the outset, followed by a more horizontal line further into training. You can use the LR range test to find a good learning rate range, and then use a cyclical learning rate to move up and down within that range.
Read more

Forward and reverse gradient-based hyperparameter optimization

In the area of hyperparameter optimization (HO), the goal is to optimize a response function of the hyperparameters. The response function is usually the average loss on a validation set. Gradient-based HO refers to iteratively finding the optimal hyperparameters using gradient updates, just as we do with neural network training itself. The gradient of the response function with respect to the hyperparameters is called the hypergradient. One of the great things about this work is that their framework allows for all kinds of hyperparameters. The response function can be based on evaluation over the training set, the validation set, or both. The hyperparameters can be part of the loss function, part of regularization, or part of the model architecture.
Read more

Understanding deep learning requires rethinking generalization

It turns out that neural networks can reach training loss of 0 even on randomly labeled data, even when the data itself is random. It was previously thought that some implicit bias in the model architecture prevented (or regularized the model away from) overfitting to specific training examples, but that’s obviously not true. They showed this empirically as just described, and also theoretically constructed a two-layer ReLU network with \(p=2n+d\) parameters to express any labeling of any sample of size \(n\) in \(d\) dimensions. The proof was actually relatively easy to follow.
Read more

Why does unsupervised pre-training help deep learning?

They’re pretty sure that it performs regularization by starting off the supervised training in a good spot, instead of by somehow improving the optimization path.