# A closer look at memorization in deep networks

Posted on
deep-learning generalization

This paper builds on what we learned in “Understanding deep learning requires rethinking generalization”. In that paper they showed that DNNs are able to fit pure noise in the same amount of time as it can fit real data, which means that our optimization algorithm (SGD, Adam, etc.) is not what’s keeping DNNs from overfitting.

## experiments for detecting easy/hard samples permalink

It looks like there are qualitative differences between a DNN that has memorized some data and a DNN that has seen real data. In experiments they found that real datasets contain “easy examples” that are more quickly learned than the hard examples. This is not the case for random data.

The second way they detected this pattern was by measuring something called loss sensitivity, which they defined as the gradient of the loss with respect to each training example after several SGD updates. An example with high loss sensitivity has a greater effect on future values of the loss. For random data, all samples have high loss sensitivity, while in real data only a few examples do.

I wonder what would happen if you used loss sensitivity to throw out training examples. Would the model generalize better?

The Gini coefficient is a measure of the inequality among values in a frequency distribution. 0 means all values are the same, while 1 means all values are different. They found that the Gini coefficient of loss sensitivity ended up much higher for real data than for random data. Surprisingly, this was even true when the model was tasked with giving each example a unique class, which is essentially the task of memorization.

In another experiment, they collected per-class loss sensitivity, or how sensitive class $$i$$ loss was to class $$j$$ examples. For random data, this value was only high when $$i=j$$, but for real data there were apparently some useful cross-class features that the model learned, resulting in higher values for per-class loss sensitivity when $$i$$ was not equal to $$j$$.

In another experiment, they found that as they increased the amount of random data in the training set, they required a higher capacity model in order to generalize to the validation data. This opposes the classical understanding that a smaller model will be less expressive and therefore be required to focus on the patterns in the true data. The authors theorize that larger models are able to memorize the noise data in a way that allows it to still remember the patterns in the true data.

In another experiment, they found that time to convergence (the time it takes to reach 100% accuracy on the training set) is much more drastically affected by model size and training set size when the data is random than when the data is real. This suggests that for real data the model is not memorizing.