| (3.39) |
The potential second-order stochastic gradient descent does not typically lead to improvements.
From a practical standpoint, it is often performed the update on a small number of samples (batch size) greater than 1, in order to reduce the noise from individual samples while avoiding overly averaged contributions as seen in the case of full gradient descent.
Finally, to simulate inertia to change, a term called momentum is added:
The values of
are typically small (for example, 0.05).
The momentum is the simplest modification to SGD that addresses some common issues in optimization within machine learning. In addition to SGD with momentum, there are numerous variants designed to accelerate convergence in gradient descent-type algorithms. A non-exhaustive list includes:
AdaGrad assigns a learning rate to each parameter, which is advantageous for problems with sparse gradients. RMSProp also utilizes a learning rate for each parameter, but this rate is adjusted based on the magnitude of the gradient: this algorithm performs well in online settings and on non-stationary problems.
Instead of updating the learning rate by analyzing the first-order moment (the average) as in RMSProp, Adam leverages the second-order moment of the gradients.
Paolo medici