raw book

Gradient Descent is an optimization algorithm that iteratively descends from a specific (sometimes random) starting point \(x_0\) in the direction of the steepest descent to locate (local) minima of a function \(f(x)\). In this regard it shares similarities with Newton's method for function minimization.

In machine learning, Gradient Descent, including variations like Stochastic Gradient Descent, is widely used for its simplicity in minimizing loss functions.

The aim of the algorithm consists of finding the minima, such that

\[\arg\min\limits_{\mathbf{x}\in\mathbb{R}^d}\; f(\mathbf{x})\]

Given the starting point \(x_0\) and a function \(f(x)\) as well as its derivative \(f'(x)\), we can see that the derivative is pointing upwards

To not go upwards, but down towards the minimum of the function, we go in the direction of the negative derivative \(-f'(x)\) and scaling this slope with an arbitrary number \(\lambda\) (in machine learning terms, it's called the learning rate) gives the step size \(\lambda\cdot f'(x)\). Subtracting (since we want to go down from the local position) the step size from the previous \(x_k\) gives the recursive update formula of Gradient Descent:

\[x_{k+1} = x_k - \lambda\cdot f'(x_k)\]

Notes about the learning rate

The learning rate \(\lambda\) is crucial for the performance of Gradient Descent. If the learning rate is too small, the algorithm makes tiny babysteps towards the minimum, which is quite slow. If the learning rate is too large, it will never converge to the minimum, since it will always overshoot.

One idea to automatically adjust \(\lambda\) is reducing (like halving) it if \(f({x_{k+1}}) > f(x_{x_{k}})\).

Multidimensional Gradient Descent

The one dimensional Gradient Descent can be extended to \(d\) dimensions. Instead of \(x\) being a scalar, we have \(\mathbf{x}\in\mathbb{R}^d\) with \(f(\mathbf{x})\) and its gradient (with each dimension pointing upwards) \(\nabla f(\mathbf{x}) = \left(\frac{\partial f(\mathbf{x})}{x_1}, \frac{\partial f(\mathbf{x})}{x_2}, ..., \frac{\partial f(\mathbf{x})}{x_k}\right)\). The Gradient Descent update rule can therefore be stated as

\[\mathbf{x}_{k+1} = \mathbf{x}_k - \lambda\cdot\nabla f(\mathbf{x}_k)\]

Stochastic Gradient Descent

Especially in machine learning, the loss function is determined by the number of data points, which makes calculating the gradient quite expensive. Stochastic Gradient Descent is an approximation, which is mathematically not optimal but good enough in most cases. Instead of \(N\) inputs, typically only one random data point is used in each step.

A compromise is a randomly selected subset, the so called mini batch. Instead of only one data point or \(N\) data points for a full batch, we choose \(n\) random data points with a rule of thumb of \(32\leq n\leq 1024 < N\). It combines the best of both worlds and is the most common approach for training machine learning models.

Adaptive Gradient Descent

Another variation is the Adaptive Gradient Descent, which makes \(\lambda\) also a vector, and \(\mathbf{\lambda}\circ\nabla f(\mathbf{x})\) being the element-wise Hadamard-Produkt. The complete Adaptive Gradient Descent is therefore

\[\mathbf{x}_{k+1} = \mathbf{x}_k - \mathbf{\lambda}\circ\nabla f(\mathbf{x}_k)\]

Moment Gradient Descent

JavaScript implementation of Gradient Descent

// Define the function to minimize: f(x) = x^2
function f(x) {
  return x * x;

// Define the derivative of function f(x)
function f_(x) {
  return 2 * x;

// Learning rate
let learningRate = 0.1;

// Precision to stop the algorithm
const precision = 1e-10;

// Max number of iterations
let iterations = 1000;

// Initial guess for the minimum
let x = 4.3; // You can choose any starting value
let y = f(x);

// Gradient Descent algorithm
do {

  let x_ = x - learningRate * f_(x);
  let y_ = f(x);

  if (y_ > y) {
    learningRate = learningRate / 2;

  if (Math.abs(x_ - x) <= precision) {

  x = x_;
  y = y_;
} while (iterations > 0);

console.log("Minimum x value:", x);
console.log("Minimum f(x) value:", f(x));