Masked autoencoders are scalable vision learners

This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. In this paper they mention that the mask vector is learned, and it sounds like the positional embeddings are also learned. I remember in Attention is all you need they found that cosine positional embeddings worked better than learned ones, especially for sequences of longer length.
Read more