# Data scaling laws in NMT: the effect of noise and architecture

Posted on
deep-learning neural-scaling

This paper is all about trying a bunch of different changes to the training setup to see what affects the power law exponent over dataset size. Here are some of the answers:

• encoder-decoder size asymmetry: exponent not affected, but effective model capacity affected
• architecture (LSTM vs. Transformer): exponent not affected, but effective model capacity affected
• dataset quality (filtered vs. not): exponent and effective model capacity not effected, losses on smaller datasets affected
• dataset source (ParaCrawl vs. in-house dataset): exponent not affected
• adding independent noise: exponent not affected, but effective model capacity affected

Here are some other things to test that I thought of while I read this:

• compare scaling with respect to language pairs (the architecture experiments saw $$p=0.28$$ and $$p=0.25$$ for en -> de and zh -> en respectively. Is that difference significant?)