optimization
This was a paper I presented about in Bang Liu’s research group meeting on 2022-08-05. You can view the slides I used here.
This post was created as an assignment in Irina Rish’s neural scaling laws course (IFT6167) in winter 2022. The post contains no summarization, only questions and thoughts.
Once I learned how the loss functions worked for each chunk, my first question was whether the earlier chunks were going to be able to learn the low-level features that later chunks would need. Figure 7 seems to show that they do, although their quality apparently decreases with increasingly local updates.
Read moreBERT optimizes the Masked Language Model (MLM) objective by masking word pieces uniformly at random in its training data and attempting to predict the masked values. With SpanBERT, spans of tokens are masked and the model is expected to predict the text in the spans from the representations of the words on the boundary. Span lengths follow a geometric distribution, and span start points are uniformly random.
To predict each individual masked token, a two-layer feedforward network was provided with the boundary token representations plus the position embedding of the target token, and the output vector representation was used to predict the masked token and compute cross-entropy loss exactly as in standard MLM.
Read moreIn the paper they use Bayes’ rule to show that the contribution of the first of two tasks is contained in the posterior distribution of model parameters over the first dataset. This is important because it means we can estimate that posterior to try to get a sense for which model parameters were most important for that first task.
In this paper, they perform that estimation using a multivariate Gaussian distribution. The means are the values of the model parameters after training on the first dataset, and the precision (inverse of variance) is the values of the diagonals along the Fisher information matrix.
Read moreThe goal of hyperparameter tuning is to reach the point where test loss is horizontal on the graph over model complexity.
Underfitting can be observed with a small learning rate, simple architecture, or complex data distribution. You can observe underfitting decrease by seeing more drastic results at the outset, followed by a more horizontal line further into training. You can use the LR range test to find a good learning rate range, and then use a cyclical learning rate to move up and down within that range.
Read more