Vision

Masked autoencoders are scalable vision learners

Posted on 2022-02-11 at 14:18:30 UTC-0500

This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. In this paper they mention that the mask vector is learned, and it sounds like the positional embeddings are also learned. I remember in Attention is all you need they found that cosine positional embeddings worked better than learned ones, especially for sequences of longer length. But now it seems like most papers are doing learned embeddings. If anyone knows why, send me an email.