Additional links:
- Original class material
<< Previous lecture |
todo clean
Binary Classification
Assume we are working in a binary classification setup, i.e. where the number of classes . In this case, we can always represent the two classes as and .
Consequently, we can develop a classification algorithm by defining a model such that and interpret the outcome of as the probability that . Thus, if , then our model predict that there are zero chances of being classified as , meaning that should be a 0. If , then will be classified as 1, while if then the model is not able to correctly identify the class of , since the chances of being are 50%.
An interesting feature of is that it turns a discontinuous task (i.e. mapping a real vector to a discrete value ) into a continuous problem. The discreteness of the outcome is then recovered by mapping to 1 if , to 0 is . In formula, a datapoint will be classified as:
What is logistic regression?
Logistic Regression is a widely known, beginner-level, binary classification algorithm based on the probabilistic approach framework. The idea is simple: consider a dataset
and consider a linear function with unknown parameters over each of the datapoints . This operation can be rewritten as . Consequently, define the extended dataset , computed adding a 1 to each datapoint, i.e.
where
Then, for any , consider the model
where is the weight vector, and is the sigmoid function, defined as
whose scope is to squeeze the real axis to the domain (to get the probabilistic interpretation). Since the output of is continuous, it is natural to select, as loss function, the Mean Squared Error, as defined previously. Indeed, the loss function becomes
and the training procedure is
which can be done by Gradient Descent (or, alternatively, Stochastic Gradient Descent), whose iteration is
for which we are required to compute
Since
then
since by \eqref{eq:model_definition}, then the chain rule implies that
An interesting feature of is that , impliying that
and
which finally leads to
which converges to the solution of \eqref{eq:training_procedure} for .
Python Implementation
Before we can start with the implementation notes, observe that we can do some operations to simplify the coding and the efficiency of it. Indeed, given the dataset , define
and
Consequently, we can re-write \eqref{eq:loss_function} as
and, by following the same procedure we did before, we get to the compact form of \eqref{eq:final_GD_iteration}:
where is the element-wise multiplication. To check that the shapes in Equation \eqref{eq:final_GD_iteration_compact} are correct, let’s do a sanity check:
- has shape , where in this case;
- has shape by definition, then has shape ;
- does not affect the shape of the input, then has shape ;
- Both and have shape , then has shape .
- Consequently, has shape because of the element-wise multiplication;
- Finally, has shape , which is the same shape of , then the computation is correct.
Equation \eqref{eq:final_GD_iteration_compact} is what we are going to implement.
Data Loading and Pre-processing
We are going to test logistic regression on a simulated dataset from the library sklearn
. Loading it into memory is very easy:
To conclude the data loading step, we need to build the dataset from above. This can be simply done by
Build the model
Remember that
thus, we need to define the sigmoid function .
The results of the application of on is not hard to compute.
Training
To perform the training, it is sufficient to implement the loss function and its gradient, which can be simply done by following the formulas above
After training, the prediction over new data can be simply done by
Results
The result is a -ple of weights such that . When , the term inside of the sigmoid is the equation of a straight line , which can be re-written in explicit form as
which is the decision boundary for the classification problem. Here it is a plot of the resulting classifier.
Extension to case
If the number of classes is bigger than 2, logistic regression can still be applied, with slight modifications. In particular, let be the number of classes . Then, we define to be a matrix,
where is a vector of probabilities (i.e. for any and ) summing to one, i.e.
Intuitively, represents the probability the is a member of the class . To enforce the condition \eqref{eq:summing_to_one}, it is mandatory to set
where is defined as
such that
Note that, in the multi-class scenario, the weights are matrices, so that is an matrix.