Lecture 10: MLE and MAP

Additional links:

Original class material

todo clean

Probabilities in Machine Learning

In the ML lecture we said that a major assumption in Machine Learning is that there exists a (possibly stochastic) target function $f (x)$ such that $y = f (x)$ for any $x \in R^{d}$ , and such that the datasets

X = [x^{1} x^{2} \dots x^{N}] \in R^{d \times N}

Y = [y^{1} y^{2} \dots y^{N}] \in R^{N}

are generated by considering $N$ independent identically distributed (i.i.d.) samples $x^{i} \sim p (x)$ , where $p (x)$ is the unknown distribution of the inputs, and considering $y^{i} = f (x^{i})$ for any $i = 1, \dots, N$ . When $f (x)$ is a stochastic function, we can consider the sampling process of $y^{i}$ as $y^{i} \sim p (y ∥ x^{i})$ for any $i = 1, \dots, N$ . In this setup, we can consider the decomposition

p (x, y) = p (y ∣ x) p (x)

where $p (x, y)$ is the joint distribution, $p (x)$ is called prior distribution over $x \in R^{d}$ , while $p (y ∥ x)$ is the likelihood or posterior distribution of $y$ given $x$ . With this framework, learning a Machine Learning model $f_{θ} (x) \approx f (x)$ for any $x \sim p (x)$ with parameters $θ \in R^{s}$ , can be reformulating as learning a parameterized distribution $p_{θ} (y ∥ x)$ which maximizes the probability of observing $y$ , given $x$ .

Maximum Likelihood Estimation (MLE)

Intuitively, we would like to find parameters $θ \in R^{s}$ such that the probability of observing $Y = [y^{1} y^{2} \dots y^{N}]$ given $X = [x^{1} x^{2} \dots x^{N}]$ is as high as possible. Consequently, we have to solve the optimization problem

θ_{M L E} = ar g θ \in R^{s} max p_{θ} (Y ∣ X)

Which is usually called Maximum Likelihood Estimation (MLE), because the parameters $θ_{M L E}$ are chosen such that they maximize the likelihood $p_{θ} (Y ∥ X)$ . Since $y^{1}, y^{2}, \dots, y^{N}$ are independent under $p (y ∥ x)$ ,

p_{θ} (Y ∣ X) = p_{θ} ((y^{1}, y^{2}, \dots, y^{N}) ∣ X) = i = 1 \prod N p_{θ} (y^{i} ∣ X)

and since $y^{i}$ is independent with $x^{j}$ for any $j \neq = i$ , then

i = 1 \prod N p_{θ} (y^{i} ∣ X) = i = 1 \prod N p_{θ} (y^{i} ∣ x^{i})

Consequently, \eqref{eq:mle_formulation1} becomes:

θ_{M L E} = ar g θ \in R^{s} max i = 1 \prod N p_{θ} (y^{i} ∣ x^{i})

Since the logarithm function is monotonic, applying it to the optimization problem \eqref{eq:mle_formulation2} does not alterate its solution. Moreover, since for any function $f (x)$ , $ar g max_{x} f (x) = ar g min_{x} - f (x)$ , \eqref{eq:mle_formulation2} can be restated as

θ_{M L E} = ar g θ \in R^{s} max i = 1 \prod N p_{θ} (y^{i} ∣ x^{i}) = ar g θ \in R^{s} min - lo g i = 1 \prod N p_{θ} (y^{i} ∣ x^{i}) = ar g θ \in R^{s} min i = 1 \sum N - lo g p_{θ} (y^{i} ∣ x^{i})

which is the classical formulation of an MLE problem. Note that in \eqref{eq:mle_formulation3}, the objective function has been decomposed into a sum over the datapoints $(x^{i}, y^{i})$ for any $i$ , as a consequence of the $x^{i}$ being i.i.d.. This formulation is similar to what we required in the previous lecture, implying that we can use SGD to (approximately) solve \eqref{eq:mle_formulation3}.

Gaussian Assumption

To effectively solve \eqref{eq:mle_formulation3}, we must explicitely define $p_{θ} (y ∣ x)$ . A common assumption, which is true for most of the scenarios, is to consider

p_{θ} (y ∣ x) = N (f_{θ} (x), σ^{2} I)

where $N (f_{θ} (x), σ^{2} I)$ is a Gaussian distribution with mean $f_{θ} (x)$ and variance $σ^{2} I$ , with $f_{θ} (x)$ a parametric deterministic function of $x$ while $σ^{2}$ is the variance of $p_{θ} (y ∥ x)$ , which depends on the informations we have on the relationship between $y$ and $x$ (it will be clearer in the following example).

An interesting property of the Gaussian distribution is that if $p_{θ} (y ∥ x) = N (f_{θ} (x), σ^{2} I)$ , then $y = f_{θ} (x) + σ^{2} e$ , where $e \sim N (0, I)$ is a Standard Gaussian distribution.

To simplify the derivation below, assume that $d = 1$ , so that $X = [x^{1} x^{2} \dots x^{N}] \in R^{N}$ and $x^{i} \in R$ for any $i$ . It is known that if $p_{θ} (y ∥ x) = N (f_{θ} (x), σ^{2})$ , then

p_{θ} (y ∣ x) = \frac{1}{2 π σ ^{2}} e^{\frac{( y - f _{θ} ( x ) ) ^{2}}{2 σ ^{2}}}

thus

- lo g p_{θ} (y ∣ x) = \frac{1}{2} lo g 2 π + \frac{1}{2} lo g σ^{2} + \frac{1}{2 σ ^{2}} (y - f_{θ} (x))^{2} = \frac{1}{2} (y - f_{θ} (x))^{2} + co n s t .

Consequently, MLE with Gaussian likelihood becomes

θ_{M L E} = ar g θ \in R^{s} min i = 1 \sum N \frac{1}{2} (y^{i} - f_{θ} (x^{i}))^{2}

which can be reformulated as a Least Squares problem

θ_{M L E} = ar g θ \in R^{s} min \frac{1}{2} ∣∣ f_{θ} (X) - Y ∣ ∣_{2}^{2}

where $Y = [y^{1} y^{2} \dots y^{N}]$ , while $f_{θ} (X) = [f_{θ} (x^{1}) f_{θ} (x^{2}) \dots f_{θ} (x^{N})]$ .

Polynomial Regression MLE

Now, consider a Regression model

f_{θ} (x) = j = 1 \sum K ϕ_{j} (x) θ_{j} = ϕ^{T} (x) θ

and assume that

p_{θ} (y ∣ x) = N (ϕ^{T} (x) θ, σ^{2})

Then, by \eqref{eq:MLE_gaussian},

θ_{M L E} = ar g θ \in R^{K} min \frac{1}{2} ∣∣Φ (X) θ - Y ∣ ∣_{2}^{2}

where

Φ (X) = [ϕ_{1} (X) ϕ_{2} (X) \dots ϕ_{K} (X)] \in R^{N \times K}

is the Vandermonde matrix associated with the vector $X$ and with feature vectors $ϕ_{1}, \dots, ϕ_{K}$ . Clearly, when $ϕ_{j} (x) = x^{j - 1}$ , the regression model $f_{θ} (x)$ is a Polynomial Regression model and the associated Vandermonde matrix $Φ (X)$ is the classical Vandermonde Matrix

Φ (X) = 11 ⋮ 1 (x^{1}) (x^{2}) ⋮ (x^{N}) (x^{1})^{2} (x^{2})^{2} ⋮ (x^{N})^{2} \dots \dots \dots \dots (x^{1})^{K - 1} (x^{2})^{K - 1} ⋮ (x^{N})^{K - 1} \in R^{N \times K}

Note that \eqref{eq:MLE_regression} defines a training procedure for a regression model. Indeed, it can be optimized by Gradient Descent (or its Stochastic variant), by solving

{θ_{0} \in R^{K} θ_{k + 1} = θ_{k} - \nabla_{θ} (- lo g p_{θ_{k}} (y ∣ x))

where

\nabla_{θ} (- lo g p_{θ} (y ∣ x)) = \nabla_{θ} \frac{1}{2} ∣∣Φ (X) θ - Y ∣ ∣_{2}^{2} = Φ (X)^{T} (Φ (X) θ - Y)

Direct solution by Normal Equations

Note that, since the learning problem is a Least Square problem of the form

θ \in R^{K} min \frac{1}{2} ∣∣Φ (X) θ - Y ∣ ∣_{2}^{2}

then it can be solved directly by the Normal Equation method, i.e.

θ^{*} = (Φ (X)^{T} Φ (X))^{- 1} Φ (X)^{T} Y

this solution can be compared with the convergence point of Gradient Descent, to check the differences.

MLE + Flexibility = Overfit

In polynomial regression, the most important parameter the user has to set is the degree of polynomial, $K$ . Indeed, when $K$ is low, the resulting model $f_{θ} (x)$ will be pretty rigid (not flexible), with the implication that it can potentially be unable to capture the complexity of the data. On the opposite side, if $K$ is too large, the resulting model is too flexible, and we end up learning the noise. The former situation, which is called underfitting, can be easily diagnoised by looking at a plot of the resulting model with respect to the data (or, equivalently, by checking the accuracy of the model). Conversely, when the model is too flexible, we are in an harder scenario known as overfitting. In overfitting, the model is not understanding the knowledge of the data, but it is memorizing the training set, usually resulting in optimal training error and bad test prediction.

Ideally, when the data is generated by a noisy polynomial experiment, we would like to set $K$ as the real degree of such polynomial. Unfortunately, this is not always possible and indeed, spotting overfitting is the hardest issue to solve while working with Machine Learning.

Solving overfitting using the error plot

A common way to solve overfit, is to plot the error of the learnt model with respect to its complexity (i.e. the degree $K$ of the polynomial). In particular, for $K = 1, 2, \dots$ , one can train a polynomial regressor $f_{θ} (x)$ of degree $K$ over the training set $(X, Y)$ and compute the training error as the average absolute error of the prediction on the training set, i.e.

T R_{K} = \frac{1}{N} ∣∣Φ (X) θ^{*} - Y ∣ ∣_{2}^{2}

and, for the same set of parameters, the test error

T E_{K} = \frac{1}{N _{t es t}} ∣∣Φ (X^{t es t}) θ^{*} - Y^{t es t} ∣ ∣_{2}^{2}

If we plot the training and test error with respect to the different values of $K$ , we will observe the following situation:

which will help us to find the correct parameter $K$ , not suffering underfitting nor overfitting.

A better solution: Maximum A Posteriori (MAP)

A completely different approach to overfitting is to change the perspective and stop using MLE. The idea is to reverse the problem and, instead of searching parameters $θ$ such that the probability of observing the outcomes $Y$ given the data $X$ is maximized, i.e. maximizing $p_{θ} (y ∥ x)$ , as in MLE, try to maximize the probability that the observed data is $(X, Y)$ , given the parameters $θ$ . Mathematically, we are asked to solve the optimization problem

θ_{M A P} = ar g θ \in R^{s} max p (θ ∣ X, Y)

Since $p (θ ∥ X, Y)$ is called posterior distribution, this method is usually referred to as **Maximum A Posteriori (MAP).

Bayes Theorem

A problem of MAP, is that it is non-trivial to find a formulation for $p (θ ∥ X, Y)$ . Indeed, if with MLE the Gaussian assumption made sense, as a consequence of the hypothesis that the observations $y$ are obtained by corrupting a deterministic function of $x$ by Gaussian noise, this does not hold true for MAP, since in general the generation of $X$ given $Y$ is not Gaussian.

Luckily, we can express the posterior distribution $p (θ ∥ X, Y)$ in terms of the likelihood $p (Y ∥ X, θ)$ (which we know to be Gaussian) and the prior $p (θ)$ , as a consequence of Bayes Theorem. Indeed, it holds

p (θ ∣ X, Y) = \frac{p ( Y ∣ X , θ ) p ( θ )}{p ( Y ∣ X )}

Gaussian assumption on MAP

For what we observed above, the posterior distribution $p (θ ∥ X, Y)$ can be rewritten as a function of the likelihood $p (Y ∥ X, θ)$ and the prior $p (θ)$ . Thus, \eqref{eq:MAP_formulation1} can be rewritten as

θ_{M A P} = ar g θ \in R^{s} max p (θ ∣ X, Y) = ar g θ \in R^{s} max \frac{p ( Y ∣ X , θ ) p ( θ )}{p ( Y ∣ X )}

With the same trick we used in MLE, we can change it to a minimum point estimation by changing the sign of the function and by taking the logarithm. We obtain,

θ_{M A P} = ar g θ \in R^{s} max \frac{p ( Y ∣ X , θ ) p ( θ )}{p ( Y ∣ X )} = ar g θ \in R^{s} min - lo g p (Y ∣ X, θ) - lo g p (θ)

where we removed $p (Y ∣ X)$ since it is constant in $θ$ .

Since $x^{1}, \dots, x^{N}$ are i.i.d. by hypothesis and by following the same procedure of MLE, we can split \eqref{eq:MAP_formulation3} into a sum over datapoints, as

θ_{M A P} = ar g θ \in R^{s} min - lo g i = 1 \prod N p (y^{i} ∣ x^{i}, θ) - lo g p (θ) = ar g θ \in R^{s} min i = 1 \sum N - lo g p (y^{i} ∣ x^{i}, θ) - lo g p (θ)

Now, if we assume that $p (y^{i} ∥ x^{i}, θ) = N (f_{θ} (x^{i}), σ^{2} I)$ , the same computation we did in MLE implies

θ_{M A P} = ar g θ \in R^{s} min i = 1 \sum N \frac{1}{2 σ ^{2}} (f_{θ} (x^{i}) - y^{i})^{2} - lo g p (θ)

To complete the derivation, we have to rewrite $p (θ)$ in a meaningful way, to be able to perform the optimization. To do that, it is common to assume that $p (θ) = N (0, σ_{θ}^{2} I)$ , a Gaussian distribution with zero mean and variance $σ_{θ}^{2}$ . Under this assumption,

- lo g p (θ) = \frac{1}{2 σ _{θ}^{2}} ∣∣ θ ∣ ∣_{2}^{2}

and consequently

θ_{M A P} = ar g θ \in R^{s} min i = 1 \sum N \frac{1}{2 σ ^{2}} (f_{θ} (x^{i}) - y^{i})^{2} + \frac{1}{2 σ _{θ}^{2}} ∣∣ θ ∣ ∣_{2}^{2} = ar g θ \in R^{s} min \frac{1}{2} ∣∣ f_{θ} (X) - Y ∣ ∣_{2}^{2} + \frac{λ}{2} ∣∣ θ ∣ ∣_{2}^{2}

where $λ = \frac{σ ^{2}}{σ _{θ}^{2}}$ is a positive parameter, usually called regularization parameter. This equation is the final MAP loss function under Gaussian assumption for both $p (Y ∥ X, θ)$ and $p (θ)$ . Clearly, it is another Least Squares problem which can be solved by Gradient Descent or Stochastic Gradient Descent.

When $f_{θ} (x)$ is a polynomial regression model, $f_{θ} (X) = Φ (X) θ$ , then

θ_{M A P} = ar g θ \in R^{s} min \frac{1}{2} ∣∣Φ (X) θ - Y ∣ ∣_{2}^{2} + \frac{λ}{2} ∣∣ θ ∣ ∣_{2}^{2}

can be also solved by Normal Equations, as

θ_{M A P} = (Φ (X)^{T} Φ (X) + λ I)^{- 1} Φ (X)^{T} Y

Ridge Regression and LASSO

When the Gaussian assumption is used for both the likelihood $p (Y ∥ X, θ)$ and the prior $p (θ)$ , the resulting MAP is usually called Ridge Regression in the literature. On the contrary, if $p (Y ∥ X, θ)$ is Gaussian and $p (θ) = L a p (0, σ_{θ}^{2})$ is a Laplacian distribution with mean 0 and variance $σ_{θ}^{2}$ , then

p (θ) = \frac{1}{2 σ _{θ}^{2}} e^{- \frac{∣ θ ∣}{σ _{θ}^{2}}}

and consequently (prove it by exercise)

θ_{M A P} = ar g θ \in R^{s} min \frac{1}{2} ∣∣Φ (X) θ - Y ∣ ∣_{2}^{2} + λ ∣∣ θ ∣ ∣_{1}

the resulting model is called LASSO, and it is the basis for most of the classical, state-of-the-art regression models.

🌱AI4Climate.science