The goal of hyperparameter tuning is to reach the point where test loss is horizontal on the graph over model complexity.

Underfitting can be observed with a small learning rate, simple architecture, or complex data distribution. You can observe underfitting decrease by seeing more drastic results at the outset, followed by a more horizontal line further into training. You can use the LR range test to find a good learning rate range, and then use a cyclical learning rate to move up and down within that range.

The *LR range test* is a run of training that starts at a small learning rate and then slowly increases until the rate is too high and the validation loss increases. This article talks a little about what performing the LR range test looks like in practice. (They don’t go into great technical detail unfortunately.) This PyTorch implementation of LRRT looks good.

*Regularization* should be balanced with the dataset and architecture. It’s silly to have a small learning rate with lots of regularization, when you could have a large learning rate and less regularization (accomplishing the same regularizing effect with faster convergence). The solution is to perform the LR range test under a variety of regularization conditions in order to find an optimal balance.

In their study of *batch size*, they took into account the total execution time for training. Modelers are interested in reaching optimal test performance in the minimum amount of wall time. A larger batch size allows for higher learning rate (#regularization) and thus faster convergence, but the benefit tapers off, probably because of the reduction in number of iterations.

*Momentum* and learning rate are interconnected; changing one changes the optimal value for the other. It turns out that using cyclical momentum that decreases as LR increases reaches similar results to a best constant momentum, but allows for larger learning rates.

For *weight decay*, the best value should remain constant through training. Validation loss early in training is sufficient for determining a good value. Smaller datasets and architectures seem to require larger values for weight decay.

**There’s a nice recipe in section 5 for putting all of this together.**

## papers that cite this one

The fast.ai paper uses a variant of the 1cycle learning policy with warm-up and annealing on both the learning rate and momentum.