
Learning transferable visual models from natural language supervision (CLIP)

This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. This concept of wide vs. narrow supervision (rather than binary “supervised” and “unsupervised”) is an interesting and flexible way to think about the way these training schemes leverage data. The zero-shot CLIP matches the performance of 4-shot CLIP, which is a surprising result. What do the authors mean when they make this guess about zero-shot’s advantage:
Read more

Distributed representations of words and phrases and their compositionality

This post was created as an assignment in Bang Liu’s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. paper summarization permalink This paper describes multiple improvements that are made to the original Skip-gram model: Decreasing the rate of exposure to common words improves the training speed and increases the model’s accuracy on infrequent words. A new training target they call “negative sampling” improves the training speed and the model’s accuracy on frequent words. Allowing the model to use phrase vectors improves the expressivity of the model. negative sampling permalink The original Skip-gram model computed probabilities using a hierarchical softmax, which allowed the model to compute only \(O(\log_2(|V|))\) probabilities when estimating the probability of a particular word, rather than \(O(|V|)\). Negative sampling, on the other hand, deals directly with the generated vector representations. The negative sampling loss function basically tries to maximize cosine similarity between the input representation of the input word with the output representation of the neighboring word, while decreasing cosine similarity between the input word and a few random vectors. They find that the required number of negative examples decreases as the dataset size increases.
Read more

Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing

Recent contextual word embeddings (e.g. ELMo) have shown to be much better than “static” embeddings (where there’s a one-to-one mapping from token to representation). This paper is exciting because they were able to create a multi-lingual embedding space that used contextual word embeddings. Each token will have a “point cloud” of embedding values, one point for each context containing the token. They define the embedding anchor as the average of all those points for a particular token. Here’s a figure from the paper that displays a two-dimensional PCA of the contextual representations for four Spanish words, along with their anchors:
Read more

SpanBERT: improving pre-training by representing and predicting spans

BERT optimizes the Masked Language Model (MLM) objective by masking word pieces uniformly at random in its training data and attempting to predict the masked values. With SpanBERT, spans of tokens are masked and the model is expected to predict the text in the spans from the representations of the words on the boundary. Span lengths follow a geometric distribution, and span start points are uniformly random. To predict each individual masked token, a two-layer feedforward network was provided with the boundary token representations plus the position embedding of the target token, and the output vector representation was used to predict the masked token and compute cross-entropy loss exactly as in standard MLM.
Read more

Deep contextualized word representations

This is the original paper introducing Embeddings from Language Models (ELMo). Unlike most widely used word embeddings, ELMo word representations are functions of the entire input sentence. That’s what makes ELMo great: they’re contextualized word representations, meaning that they can express multiple possible senses of the same word. Specifically, ELMo representations are a learned linear combination of all layers of an LSTM encoding. The LSTM undergoes general semi-supervised pretraining, but the linear combination is learned specific to the task. It’s been shown that initial layers in LSTM encoders are more representative of syntax, while later layers tend to represent semantics, so this linear combination is a key advantage that allows ELMo to improve accuracy on tasks ranging from POS tagging to question answering.
Read more