Gradient Descent

The gradient descent algorithm (gradient descent GD or steepest descent) updates the weights $\boldsymbol\beta$ at each iteration using the gradient (specifically, the negative gradient) of $S(\boldsymbol\beta)$:

\begin{displaymath}
\boldsymbol\beta_{t+1} = \boldsymbol\beta_{t} - \gamma \nab...
...{t} - \gamma \sum_{i=1}^{n} \nabla \ell_i(\boldsymbol\beta_t)
\end{displaymath} (3.36)

or
\begin{displaymath}
\boldsymbol\delta_t = - \gamma \sum_{i=1}^{n} \nabla \ell_i(\boldsymbol\beta_t)
\end{displaymath} (3.37)

where $\gamma$ is an appropriately chosen optimization factor (in machine learning it is referred to as the learning rate). Under suitable assumptions, if the starting point is close to the solution and the factor $\gamma$ is sufficiently low, the rate of convergence that can be achieved is practically linear.

By introducing a user-defined parameter $\gamma$ into the expression (3.36), this approach is empirical and problem-dependent, if not reliant on the experience of the human user. Following this observation and comparing the gradient descent equation with equation (3.35), it becomes evident that Newton's method is, in fact, a special case of gradient descent. Therefore, one can achieve better optimization by replacing the scalar parameter $\gamma$ with the positive definite matrix $\boldsymbol\Gamma_t$ obtained from the inverse of the Hessian at the point:

\begin{displaymath}
\boldsymbol\beta_{t+1} = \boldsymbol\beta_{t} - \boldsymbol\Gamma_t \nabla S(\boldsymbol\beta_t)
\end{displaymath} (3.38)

A second-order gradient descent is thus the Newton algorithm, which, under appropriate assumptions, yields quadratic convergence instead of linear convergence.

Paolo medici
2025-10-22