Generalization

generalization

A closer look at memorization in deep networks

Posted on 2020-08-31 at 11:52:35 UTC-0600

This paper builds on what we learned in “Understanding deep learning requires rethinking generalization”. In that paper they showed that DNNs are able to fit pure noise in the same amount of time as it can fit real data, which means that our optimization algorithm (SGD, Adam, etc.) is not what’s keeping DNNs from overfitting. experiments for detecting easy/hard samples permalink It looks like there are qualitative differences between a DNN that has memorized some data and a DNN that has seen real data. In experiments they found that real datasets contain “easy examples” that are more quickly learned than the hard examples. This is not the case for random data.

Why does unsupervised pre-training help deep learning?

Posted on 2020-08-24 at 11:40:00 UTC-0600

They’re pretty sure that it performs regularization by starting off the supervised training in a good spot, instead of by somehow improving the optimization path.

The consciousness prior

Posted on 2020-08-14 at 09:05:56 UTC-0700

System 1 cognitive abilities are about low-level perception and intuitive knowledge. System 2 cognitive abilities can be described verbally, and include things like reasoning, planning, and imagination. In cognitive neuroscience, the “Global Workspace Theory” says that at each moment specific pieces of information become a part of working memory and become globally available to other unconscious computational processes. Relative to the unconscious state, the conscious state is low-dimensional, focusing on a few things. The paper proposes we use an attention mechanism (in the sense of the Bahdanau 2015 paper) to produce the conscious state, and then a VAE or conditional GAN to produce the output from the conscious state.

Compositional generalization by factorizing alignment and translation

Posted on 2020-07-27 at 09:11:16 UTC-0700

They had a biRNN with attention for alignment encoding, and then a single linear function of each one-hot encoded word for encoding that single word. Their reasoning was that by separating the alignment from the meaning of individual words the model could more easily generalize to unseen words.