Lecture 11: Logistic regression

Additional links:

Original class material

todo clean

Binary Classification

Assume we are working in a binary classification setup, i.e. where the number of classes $K = 2$ . In this case, we can always represent the two classes as $y = 0$ and $y = 1$ .

Consequently, we can develop a classification algorithm by defining a model $f_{w} (x)$ such that $f_{w} : R^{d} \to [0, 1]$ and interpret the outcome of $f_{w} (x)$ as the probability that $y = 1$ . Thus, if $f_{w} (x) = 0$ , then our model predict that there are zero chances of $x$ being classified as $1$ , meaning that $x$ should be a 0. If $f_{w} (x) = 1$ , then $x$ will be classified as 1, while if $f_{w} (x) = 0.5$ then the model is not able to correctly identify the class of $x$ , since the chances of being $1$ are 50%.

An interesting feature of $f_{w} (x)$ is that it turns a discontinuous task (i.e. mapping a real vector $x$ to a discrete value $y$ ) into a continuous problem. The discreteness of the outcome is then recovered by mapping $f_{w} (x)$ to 1 if $f_{w} (x) > 0.5$ , to 0 is $f_{w} (x) < 0$ . In formula, a datapoint $x \in R^{d}$ will be classified as:

{1 if f_{w} (x) > 0.5 0 if f_{w} (x) < 0.5

What is logistic regression?

Logistic Regression is a widely known, beginner-level, binary classification algorithm based on the probabilistic approach framework. The idea is simple: consider a dataset

X = [x^{1} x^{2} \dots x^{N}] \in R^{d \times N}

and consider a linear function with unknown parameters $w = (w_{0}, w_{1}, \dots, w_{d}) \in R^{d + 1}$ over each of the datapoints $w_{0} + w_{1} x_{1}^{i} + \dots w_{d} x_{d}^{i}$ . This operation can be rewritten as $[1; x^{i}]^{T} w$ . Consequently, define the extended dataset $\hat{X}$ , computed adding a 1 to each datapoint, i.e.

\hat{X} = [\overset{x}{^}^{1} \overset{x}{^}^{2} \dots \overset{x}{^}^{N}] \in R^{(d + 1) \times N}

where

\overset{x}{^} = [1, x_{1}, x_{2}, \dots, x_{d}]^{T} .

Then, for any $i = 1, \dots, N$ , consider the model

f_{w} (x) = σ (\overset{x}{^}^{T} w)

where $w \in R^{d + 1}$ is the weight vector, and $σ (z)$ is the sigmoid function, defined as

σ (z) = \frac{1}{1 + e ^{- z}}

whose scope is to squeeze the real axis to the domain $[0, 1]$ (to get the probabilistic interpretation). Since the output of $f_{w} (x)$ is continuous, it is natural to select, as loss function, the Mean Squared Error, as defined previously. Indeed, the loss function becomes

ℓ (w; D) = \frac{1}{N} i = 1 \sum N MSE (f_{w} (x_{i}), y_{i})

and the training procedure is

w^{*} = ar g w \in R^{n} min ℓ (w; D)

which can be done by Gradient Descent (or, alternatively, Stochastic Gradient Descent), whose iteration is

w_{k + 1} = w_{k} - α_{k} \nabla_{w} ℓ (w_{k}; D)

for which we are required to compute

\nabla_{w} ℓ (w; D) = \frac{1}{N} i = 1 \sum N \nabla_{w} MSE (f_{w} (x_{i}), y_{i}) .

Since

MSE (f_{w} (x_{i}), y_{i}) = \frac{1}{2} (f_{w} (x_{i}) - y_{i})^{2}

then

\nabla_{w} MSE (f_{w} (x_{i}), y_{i}) = \nabla_{w} (\frac{1}{2} (f_{w} (x_{i}) - y_{i})^{2}) = \nabla_{w} f_{w} (x_{i})^{T} (f_{w} (x_{i}) - y_{i})

since $f_{w} (x_{i}) = σ (\overset{x}{^}_{i}^{T} w)$ by \eqref{eq:model_definition}, then the chain rule implies that

\nabla_{w} f_{w} (x_{i}) = σ^{'} (\overset{x}{^}_{i}^{T} w) \overset{x}{^}_{i}^{T} .

An interesting feature of $σ (z)$ is that $σ^{'} (z) = σ (z) (1 - σ (z))$ , impliying that

\nabla_{w} f_{w} (x_{i}) = σ (\overset{x}{^}_{i}^{T} w) (1 - σ (\overset{x}{^}_{i}^{T} w)) \overset{x}{^}_{i}^{T}

and

\nabla_{w} MSE (f_{w} (x_{i}), y_{i}) = σ (\overset{x}{^}_{i}^{T} w) (1 - σ (\overset{x}{^}_{i}^{T} w)) \overset{x}{^}_{i}^{T} (f_{w} (x_{i}) - y_{i})

which finally leads to

w_{k + 1} = w_{k} - \frac{α _{k}}{N} i = 1 \sum N σ (\overset{x}{^}_{i}^{T} w) (1 - σ (\overset{x}{^}_{i}^{T} w)) \overset{x}{^}_{i}^{T} (f_{w} (x_{i}) - y_{i})

which converges to the solution of \eqref{eq:training_procedure} for $k \to \infty$ .

Python Implementation

Before we can start with the implementation notes, observe that we can do some operations to simplify the coding and the efficiency of it. Indeed, given the dataset $\hat{X} = [\overset{x}{^}^{1} \overset{x}{^}^{2} \dots \overset{x}{^}^{N}] \in R^{d \times N}$ , define

f_{w} (\hat{X}) = [f_{w} (\overset{x}{^}^{1}) f_{w} (\overset{x}{^}^{2}) \dots f_{w} (\overset{x}{^}^{N})] \in R^{N}

and

Y = [y^{1} y^{2} \dots y^{N}]^{T} \in R^{N}

Consequently, we can re-write \eqref{eq:loss_function} as

ℓ (w; D) = \frac{1}{N} i = 1 \sum N \frac{1}{2} (f_{w} (x^{i}) - y^{i})^{2} = \frac{1}{2 N} ∣∣ f_{w} (\hat{X}) - Y ∣ ∣_{2}^{2}

and, by following the same procedure we did before, we get to the compact form of \eqref{eq:final_GD_iteration}:

w_{k + 1} = w_{k} - \frac{α _{k}}{N} \hat{X}^{T} (σ (\hat{X}^{T} w) ⊙ (1 - σ (\hat{X}^{T} w)) ⊙ (f_{w} (\hat{X}) - Y))

where $⊙$ is the element-wise multiplication. To check that the shapes in Equation \eqref{eq:final_GD_iteration_compact} are correct, let’s do a sanity check:

$\hat{X}$ has shape $k \times N$ , where $k = d + 1$ in this case;
$w$ has shape $k \times 1$ by definition, then $\hat{X}^{T} w$ has shape $N \times 1$ ;
$σ (\cdot)$ does not affect the shape of the input, then $σ (\hat{X}^{T} w) ⊙ (1 - σ (\hat{X}^{T} w))$ has shape $N \times 1$ ;
Both $f_{w} (\hat{X})$ and $Y$ have shape $N \times 1$ , then $f_{w} (\hat{X}) - Y$ has shape $N \times 1$ .
Consequently, $σ (\hat{X}^{T} w) ⊙ (1 - σ (\hat{X}^{T} w)) ⊙ (f_{w} (\hat{X}) - Y)$ has shape $N \times 1$ because of the element-wise multiplication;
Finally, $\hat{X}^{T} (σ (\hat{X}^{T} w) ⊙ (1 - σ (\hat{X}^{T} w)) ⊙ (f_{w} (\hat{X}) - Y))$ has shape $k \times 1$ , which is the same shape of $w_{k}$ , then the computation is correct.

Equation \eqref{eq:final_GD_iteration_compact} is what we are going to implement.

Data Loading and Pre-processing

We are going to test logistic regression on a simulated dataset from the library sklearn. Loading it into memory is very easy:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
 
# Load data
X, Y = make_classification(n_samples=500, n_features=2, n_redundant=0, n_informative=1, 
                            n_clusters_per_class=1)
X = X.T # To make it d x N
 
# Check the shape
print(f"Shape of X: {X.shape}")
print(f"Shape of Y: {Y.shape}")
 
# Memorize the shape
d, N = X.shape
 
# Add dimension on Y
Y = Y.reshape((N, 1))

To conclude the data loading step, we need to build the $\hat{X}$ dataset from above. This can be simply done by

# Create Xhat
Xhat = np.concatenate((np.ones((N,1)), X), axis=0)

Build the model

Remember that

f_{w} (\overset{x}{^}) = σ (\overset{x}{^}^{T} w)

thus, we need to define the sigmoid function $σ (z)$ .

# Define sigmoid
def sigmoid(x):
    ...

The results of the application of $\overset{x}{^}$ on $f_{w} (\overset{x}{^})$ is not hard to compute.

# Compute the value of f
def f(w, xhat):
    ...

Training

To perform the training, it is sufficient to implement the loss function and its gradient, which can be simply done by following the formulas above

# Value of the loss
def ell(w, X, Y):
    ...
 
# Value of the gradient
def grad_ell(w, X, Y):
    ...

After training, the prediction over new data can be simply done by

def predict(w, X, treshold=0.5):
    ...

Results

The result is a $(d + 1)$ -ple of weights $w = (w_{0}, w_{1}, \dots, w_{d})$ such that $f_{w} (x) = σ (\sum_{i = 0}^{d} w_{i} x_{i})$ . When $d = 2$ , the term inside of the sigmoid is the equation of a straight line $w_{0} + w_{1} x_{1} + w_{2} x_{2} = 0$ , which can be re-written in explicit form as

x_{2} = - \frac{w _{1}}{w _{2}} x_{1} - \frac{w _{0}}{w _{2}}

which is the decision boundary for the classification problem. Here it is a plot of the resulting classifier.

Extension to $K > 2$ case

If the number of classes $K$ is bigger than 2, logistic regression can still be applied, with slight modifications. In particular, let $K > 2$ be the number of classes $C = {C_{1}, \dots, C_{K}}$ . Then, we define $Y$ to be a $K \times N$ matrix,

Y = [y^{1} y^{2} \dots y^{N}] \in R^{K \times N}

where $y^{i} = [y_{1}^{i}, y_{2}^{i}, \dots, y_{K}^{i}] \in R^{K}$ is a vector of probabilities (i.e. $y_{k}^{i} \in [0, 1]$ for any $i = 1, \dots, N$ and $k \in 1, \dots, K$ ) summing to one, i.e.

k = 1 \sum K y_{k}^{i} = 1 \forall i = 1, \dots, N

Intuitively, $y_{k}^{i}$ represents the probability the $x^{i}$ is a member of the class $C_{k}$ . To enforce the condition \eqref{eq:summing_to_one}, it is mandatory to set

f_{w} (x) = so f t ma x (\overset{x}{^}^{T} w)

where $so f t ma x (z)$ is defined as

so f t ma x (z)_{i} = \frac{e ^{z_{i}}}{\sum _{j = 1}^{K} e ^{z_{j}}}

such that

i = 1 \sum K so f t ma x (z)_{i} = 1

Note that, in the multi-class scenario, the weights are $d \times K$ matrices, so that $X^{T} W$ is an $N \times K$ matrix.

🌱AI4Climate.science

Lecture 11: Logistic regression

Binary Classification

What is logistic regression?

Python Implementation

Data Loading and Pre-processing

Build the model

Training

Results

Extension to $K > 2$ case

Table of Contents

Backlinks

Graph View

🌱AI4Climate.science

Lecture 11: Logistic regression

Binary Classification

What is logistic regression?

Python Implementation

Data Loading and Pre-processing

Build the model

Training

Results

Extension to K>2 case

Table of Contents

Backlinks

Graph View

Extension to $K > 2$ case