papers
These are my notes from research papers I read. Each page’s title is also a link to the abstract or PDF.
This post was created as an assignment in Bang Liu’s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments.
paper summarization permalink This paper describes multiple improvements that are made to the original Skip-gram model:
Decreasing the rate of exposure to common words improves the training speed and increases the model’s accuracy on infrequent words. A new training target they call “negative sampling” improves the training speed and the model’s accuracy on frequent words. Allowing the model to use phrase vectors improves the expressivity of the model. negative sampling permalink The original Skip-gram model computed probabilities using a hierarchical softmax, which allowed the model to compute only \(O(\log_2(|V|))\) probabilities when estimating the probability of a particular word, rather than \(O(|V|)\). Negative sampling, on the other hand, deals directly with the generated vector representations. The negative sampling loss function basically tries to maximize cosine similarity between the input representation of the input word with the output representation of the neighboring word, while decreasing cosine similarity between the input word and a few random vectors. They find that the required number of negative examples decreases as the dataset size increases.
Read moreThis post was created as an assignment in Bang Liu’s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments.
paper summarization permalink The authors use the example of distinguishing between a Samoyed and a white wolf to talk about the importance of learning to rely on very small variations while ignoring others. While shallow classifiers must rely on human-crafted features which are expensive to build and always imperfect, deep classifiers are expected to learn their own features by applying a “general-purpose learning procedure” to learn the features and the classification layer from the data simultaneously.
Read moreRecent contextual word embeddings (e.g. ELMo) have shown to be much better than “static” embeddings (where there’s a one-to-one mapping from token to representation). This paper is exciting because they were able to create a multi-lingual embedding space that used contextual word embeddings.
Each token will have a “point cloud” of embedding values, one point for each context containing the token. They define the embedding anchor as the average of all those points for a particular token. Here’s a figure from the paper that displays a two-dimensional PCA of the contextual representations for four Spanish words, along with their anchors:
Read moreThis is a long paper, so a lot of my writing here is an attempt to condense the discussion. I’ve taken the liberty to pull exact phrases and structure from the paper without explicitly using quotes.
Our main hypothesis is that deep learning succeeded in part because of a set of inductive biases, but that additional ones should be added in order to go from good in-distribution generalization in highly supervised learning tasks (or where strong and dense rewards are available), such as object recognition in images, to strong out-of-distribution generalization and transfer learning to new tasks with low sample complexity.
Read moreBERT optimizes the Masked Language Model (MLM) objective by masking word pieces uniformly at random in its training data and attempting to predict the masked values. With SpanBERT, spans of tokens are masked and the model is expected to predict the text in the spans from the representations of the words on the boundary. Span lengths follow a geometric distribution, and span start points are uniformly random.
To predict each individual masked token, a two-layer feedforward network was provided with the boundary token representations plus the position embedding of the target token, and the output vector representation was used to predict the masked token and compute cross-entropy loss exactly as in standard MLM.
Read more