Homework 3: Optimization

Optimization via Gradient Descent

In this Homework, we will consider a general optimization problem:

x^{*} = ar g x \in R^{n} min f (x) .

where, $f : R^{n} \to R$ is a differentiable function for which we know how to compute $\nabla f (x)$ . This is done by the Gradient Descent (GD) method: an iterative algorithm that, given an initial iterate $x_{0} \in R^{n}$ and a positive parameter $α_{k}$ called step size, computes:

x_{k + 1} = x_{k} - α_{k} \nabla f (x_{k}) .

You are asked to implement the GD method in Python and to test it with some exemplar functions. In particular:

Write a script that implement the GD algorithm with fixed step size (i.e. no backtracking), with the input-output structure discussed in the first Exercise of the Gradient Descent section (https://devangelista2.github.io/statistical-mathematical-methods/Optimization/GD.html).
Write a script that implement the GD algorithm with backtracking, with the input-output structure discussed in the second Exercise of the Gradient Descent section (https://devangelista2.github.io/statistical-mathematical-methods/Optimization/GD.html).
Test the algorithm above on the following functions:
1. $f : R^{2} \to R$ such that:
  $f (x_{1}, x_{2}) = (x_{1} - 3)^{2} + (x_{2} - 1)^{2},$
  for which the true solution is $x^{*} = (3, 1)^{T}$ .
2. $f : R^{2} \to R$ such that:
  $f (x_{1}, x_{2}) = 10 (x_{1} - 1)^{2} + (x_{2} - 2)^{2},$
  for which the true solution is $x^{*} = (1, 2)^{T}$ .
3. $f : R^{n} \to R$ such that:
  $f (x) = \frac{1}{2} ∣∣ A x - b ∣ ∣_{2}^{2},$
  where $A \in R^{n t im es n}$ is the Vandermonde matrix associated with the vector $v \in R^{n}$ that contains $n$ equispaced values in the interval $[0, 1]$ , and $b \in R^{n}$ is computed by first setting $x^{*} = (1, 1, \dots, 1)^{T}$ , and then $b = A x^{*}$ . Try for different values of $n$ (e.g. $n = 5, 10, 15, \dots$ ).
4. $f : R^{n} \to R$ such that:
  $f (x) = \frac{1}{2} ∣∣ A x - b ∣ ∣_{2}^{2} + \frac{λ}{2} ∣∣ x ∣ ∣_{2}^{2},$
  where $A \in R^{n t im es n}$ and $b \in R^{n}$ are the same of the exercise above, while $λ$ is a fixed value in the interval $[0, 1]$ . Try different values of $λ$ and comment the result.
5. $f : R \to R$ such that:
  $f (x) = x^{4} + x^{3} - 2 x^{2} - 2 x .$
For each of the functions above, test the GD method with and without backtracking, trying different values for the step size $α > 0$ when backtracking is not employed. Comment on the results.
Plot the value of $∣∣\nabla f (x_{k}) ∣ ∣_{2}$ as a function of $k$ , check that it goes to zero, and compare the convergence speed (in terms of the number of iterations $k$ ) for the different values of $α > 0$ and with backtracking.
For each of the points above, use:
- x0 = $(0, 0, \dots, 0)^{T}$ (except for function 5, which is discussed in the following point),
- kmax = 100,
- tolf = tolx = 1e-5. Also, when the true solution $x^{*}$ is given, plot the error $∣∣ x_{k} - x^{*} ∣ ∣_{2}$ as a function of $k$ .
Plot the graph of the non-convex function 5 in the interval $[- 3, 3]$ , and test the convergence of GD with different values of x0 (of your choice) and different step-sizes. When is the convergence point the global minimum?
Hard (optional): For functions 1 and 2, show the contour plot around the true minimum and visualize the path described by the iterations, i.e. representing on the contour plot the position of each iterate computed by the GD algorithm. See the plt.contour documentation.

Optimization via Stochastic Gradient Descent

Consider a dataset $(X, Y)$ , where:

X = [x^{1} x^{2} \dots x^{N}] \in R^{d \times N}, Y = [y^{1} y^{2} \dots y^{N}] \in R^{N},

together with a model $f_{θ} (x)$ , with vector of parameters $θ$ . Training a ML model requires solving:

θ^{*} = ar g θ min ℓ (θ; X, Y) = ar g θ min i = 1 \sum N ℓ_{i} (θ; x^{i}, y^{i}) .

Since the optimization problem above is written as a sum of independent terms that only depends on the single datapoints, it satisfies the hypothesis for the application of the Stochastic Gradient Descent (SGD) algorithm, which articulates as follows:

Given an integer batch_size, randomly extract a sub-dataset $M$ such that $∣ M ∣ = ‘ ba t c h_{s} i ze ‘$ from the original dataset. Note that the random sampling at each iteration has to be done without replacement.
Compute the gradient of the loss function on the sampled batch $M$ as:
$\nabla ℓ (θ; M) = \frac{1}{∣ M ∣} i \in M \sum \nabla ℓ (θ; x^{i}, y^{i}),$
Compute one single iteration of the GD algorithm on the direction described by $\nabla ℓ (θ; M)$ :
$θ_{k + 1} = θ_{k} - α_{k} \nabla ℓ (θ_{k}; M),$
Repeat until the full dataset has been extracted. When this happens, we say that we completed an epoch of the SGD method. Repeat this procedure for a number of epochs equal to a parameter n_epochs, given as input.

Consider the dataset poly_regression_large.csv, provided on Virtuale, and let $f_{θ} (x)$ be a polynomial regression model, as discussed in https://devangelista2.github.io/statistical-mathematical-methods/regression_classification/regression.html.

Split the dataset into training and test set as in the Homework 2, with a proportion of 80% training and 20% test.
Fix a degree $K$ for the polynomial.
Train the polynomial regression model on the training set via the Stochastic Gradient Descent algorithm.
Train the polynomial regression model on the training set via the Gradient Descent algorithm.
Train the polynomial regression model on the poly_regression_small.csv dataset. Use the full dataset for this test, without splitting it into training and test set.
Compare the performance of the three regression model computed above. In particular, if $(X_{t es t}, Y_{t es t})$ is the test set from the poly_regression_large.csv dataset, for each of the model, compute:
$E rr = \frac{1}{N _{t es t}} i = 1 \sum N_{t es t} (f_{θ} (x^{i}) - y^{i})^{2},$
where $N_{t es t}$ is the number of elements in the test set, $(x^{i}, y^{i})$ are the input and output elements in the test set. Comment the performance of the three models.
Repeat the experiment by varying the degree $K$ of the polynomial. Comment the results.
Set $K = 5$ (so that the polynomial regression model is a polynomial of degree 4). Compare the parameters learned by the three models with the true parameter $θ^{*} = [0, 0, 4, 0, - 3]$ .

Other assignments:

🌱AI4Climate.science

Homework 3: Optimization

Optimization via Gradient Descent

Optimization via Stochastic Gradient Descent

Table of Contents

Backlinks

Graph View