Stochastic Gradient Descent

Adam

Stochastic Gradient Descent

The stochastic gradient descent (SGD) algorithm is a simplification of the gradient descent algorithm. Instead of calculating the exact gradient of $S(\boldsymbol\beta)$ for each iteration, the gradient of one randomly chosen sample $\ell_i$ is used.

$\begin{displaymath} \boldsymbol\beta_{t+1} = \boldsymbol\beta_{t} - \gamma_t \nabla \ell_i(\boldsymbol\beta_t) \end{displaymath}$

(3.39)

SGD can be employed to optimize any convex function over a convex domain.

The potential second-order stochastic gradient descent does not typically lead to improvements.

From a practical standpoint, it is often performed the update on a small number of samples (batch size) greater than 1, in order to reduce the noise from individual samples while avoiding overly averaged contributions as seen in the case of full gradient descent.

Finally, to simulate inertia to change, a term $\alpha$ called momentum is added: The values of $\alpha$ are typically small (for example, 0.05).

The momentum is the simplest modification to SGD that addresses some common issues in optimization within machine learning. In addition to SGD with momentum, there are numerous variants designed to accelerate convergence in gradient descent-type algorithms. A non-exhaustive list includes:

Nesterov Accelerated Gradient (NAG)
AdaDelta
Adaptive Gradient (AdaGrad)
RMSProp (Rprop + SGD)
Adam
AdaMax
Momentum
Resilient Propagation (RProp)

A survey of the various algorithms can be found in (Rud16).

Adam

Adam (KB14) (Adaptive Momentum estimation) aims to leverage the strengths of both AdaGrad (Adaptive Gradient Algorithm) and RMSProp (Root Mean Square Propagation).

AdaGrad assigns a learning rate to each parameter, which is advantageous for problems with sparse gradients. RMSProp also utilizes a learning rate for each parameter, but this rate is adjusted based on the magnitude of the gradient: this algorithm performs well in online settings and on non-stationary problems.

Instead of updating the learning rate by analyzing the first-order moment (the average) as in RMSProp, Adam leverages the second-order moment of the gradients.

Paolo medici
2025-10-22