Additional links:

<< Previous lecture | Next lecture >>


todo clean

Optimization in Machine Learning

While working with Machine Learning (ML) it is common to have a dataset and a parametric function whose specific shape depends on the task. As already cited, training a Machine Learning model is basically an optimization problem, where we need to find parameters (known as weights), such that for any . To do that, we usually consider a loss function (see here for more details), which in this case depends on the weights and the dataset . We will indicate is as .

In most of the cases, can be written as a sum of simpler components, each depending on a specific datapoint, i.e.

and the training optimization problem becomes

Which can be solved by Gradient Descent, as

Where we used that .

Thus, to compute each iteration of Gradient Descent we need the gradient with respect to the weights of the objective functions, that can be computed by summing up the gradients of the independent functions .

As an example, a common loss function in Machine Learning is the Mean Squared Error (MSE), defined by

where

Here, computing is not hard, since

but

by applying the chain rule. When can be explicitly computed (it depends on the shape of ), then the gradient descent iteration to solve the training optimization problem can be implemented as

Stochastic Gradient Descent (SGD)

Unfortunately, even if it is easy to compute the gradient of for any , when the number of samples is large (which is common in Machine Learning), the computation of the full gradient is prohibitive, mostly because of memory limitations. For this reason, in such optimization problems, instead of using a standard GD algorithm, it is better using the Stochastic Gradient Descent (SGD) method. That is a variant of the classical GD where, instead of computing , the summation is reduced to a limited number of terms, called a batch. The idea is the following:

  • Given a number (usually called batch_size), randomly extract a subdataset with from . This set will be called a batch;

  • Approximate the true gradient

    with

  • Compute one single iteration of the GD algorithm
  • Repeat until you have extracted the full dataset. Notice that the random sampling at each iteration is done without replacement.

Each iteration of the algorithm above is usually called batch iteration. When the whole dataset has been processed, we say that we completed an epoch of the SGD method. This algorithm should be repeated for a fixed number of epochs to reach convergence.

Drawbacks of SGD

Unfortunately, one of the biggest drawbacks of SGD with respect to GD, is that now we cannot check the convergence anymore (since we can’t obviously compute the gradient of to check its distance from zero, as it is required for the first Stopping Criteria) and we can’t use the backtracking algorithm, for the same reason. As a consequence, the algorithm will stop ONLY after reaching the fixed number of epochs, and we must set a good value for the step size by hand. Those problems are partially solved by recent algorithms like SGD with Momentum, Adam, AdaGrad, … whose study is beyond the scope of the course.