Deep-Learning

deep-learning

Data scaling laws in NMT: the effect of noise and architecture

Posted on 2022-02-09 at 20:47:59 UTC-0500

This paper is all about trying a bunch of different changes to the training setup to see what affects the power law exponent over dataset size. Here are some of the answers: encoder-decoder size asymmetry: exponent not affected, but effective model capacity affected architecture (LSTM vs. Transformer): exponent not affected, but effective model capacity affected dataset quality (filtered vs. not): exponent and effective model capacity not effected, losses on smaller datasets affected dataset source (ParaCrawl vs. in-house dataset): exponent not affected adding independent noise: exponent not affected, but effective model capacity affected Here are some other things to test that I thought of while I read this:

Parallel training of deep networks with local updates

Posted on 2022-02-09 at 10:50:21 UTC-0500

This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. Once I learned how the loss functions worked for each chunk, my first question was whether the earlier chunks were going to be able to learn the low-level features that later chunks would need. Figure 7 seems to show that they do, although their quality apparently decreases with increasingly local updates.

A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification

Posted on 2022-02-02 at 15:35:00 UTC-0500

This post was created as an assignment in Bang Liu’s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. paper summarization permalink Word embeddings have gotten so good that state-of-the-art sentence classification can often be achieved with just a one-layer convolutional network on top of those embeddings. This paper dials in on the specifics of training that convolutional layer for this downstream sentence classification task.

Learning transferable visual models from natural language supervision (CLIP)

Posted on 2022-02-02 at 12:35:03 UTC-0500

This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts. This concept of wide vs. narrow supervision (rather than binary “supervised” and “unsupervised”) is an interesting and flexible way to think about the way these training schemes leverage data. The zero-shot CLIP matches the performance of 4-shot CLIP, which is a surprising result. What do the authors mean when they make this guess about zero-shot’s advantage:

Distributed representations of words and phrases and their compositionality

Posted on 2022-02-01 at 16:09:19 UTC-0500

This post was created as an assignment in Bang Liu’s IFT6289 course in winter 2022. The structure of the post follows the structure of the assignment: summarization followed by my own comments. paper summarization permalink This paper describes multiple improvements that are made to the original Skip-gram model: Decreasing the rate of exposure to common words improves the training speed and increases the model’s accuracy on infrequent words. A new training target they call “negative sampling” improves the training speed and the model’s accuracy on frequent words. Allowing the model to use phrase vectors improves the expressivity of the model. negative sampling permalink The original Skip-gram model computed probabilities using a hierarchical softmax, which allowed the model to compute only \(O(\log_2(|V|))\) probabilities when estimating the probability of a particular word, rather than \(O(|V|)\). Negative sampling, on the other hand, deals directly with the generated vector representations. The negative sampling loss function basically tries to maximize cosine similarity between the input representation of the input word with the output representation of the neighboring word, while decreasing cosine similarity between the input word and a few random vectors. They find that the required number of negative examples decreases as the dataset size increases.