embedding

Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing

Recent contextual word embeddings (e.g. ELMo) have shown to be much better than “static” embeddings (where there’s a one-to-one mapping from token to representation). This paper is exciting because they were able to create a multi-lingual embedding space that used contextual word embeddings. Each token will have a “point cloud” of embedding values, one point for each context containing the token. They define the embedding anchor as the average of all those points for a particular token. Here’s a figure from the paper that displays a two-dimensional PCA of the contextual representations for four Spanish words, along with their anchors:
Read more

Deep contextualized word representations

This is the original paper introducing Embeddings from Language Models (ELMo). Unlike most widely used word embeddings, ELMo word representations are functions of the entire input sentence. That’s what makes ELMo great: they’re contextualized word representations, meaning that they can express multiple possible senses of the same word. Specifically, ELMo representations are a learned linear combination of all layers of an LSTM encoding. The LSTM undergoes general semi-supervised pretraining, but the linear combination is learned specific to the task. It’s been shown that initial layers in LSTM encoders are more representative of syntax, while later layers tend to represent semantics, so this linear combination is a key advantage that allows ELMo to improve accuracy on tasks ranging from POS tagging to question answering.
Read more