Additional links:

<< Previous lecture | Next lecture >>


todo clean

Definition: Machine Learning (ML) is the set of all the techniques and algorithms able to extract knowledge from the data, and use that knowledge to make accurate previsions.

Following the definition, it is clear that a good Machine Learning algorithm is always developed by following some steps:

  • Understanding: Understand the task (e.g. what do we need? what are the informations we are able to collect to answer the question we are asking for?);
  • Collection: Collect a big set of data, containing enough informations to be able to use them to achieve the task above;
  • Design: Design the Machine Learning algorithm, based on the knowledge we have on the studied problem;
  • Training: Train the algorithm on the collected data, trying to minimize the prediction error on the given dataset;
  • Tuning: Eventually tune some parameters of the model (a ML algorithm is usually referred to as model) to improve the predictions;
  • Testing: Test the algorithm on new data, verifying its ability on making predictions;

We’re going to investigate each of those steps more deeply in the following.

Understanding

Assume we want to solve a given problem. Mathematically, the problem we aim to solve can be modelled as an (unknown) function , taking as input a vector containing the informations we are able to collect and mapping them (possibly stochastically) to the task . When this is the case, is usually called input vector or alternatively feature vector, while is the target (equivalently label or output).

Solving the problem means being able to approximate as good as possible with a model (that we will always indicate as , being the set of parameters defining it), such that

Question 1: Is it learnable?

A problem can be solved by a ML algorithm if and only if there exists a relationship between and . For example, we cannot expect to predict the future weather in a particular position by using informations about the stock price of a particular company. In that situation, the input and the output are clearly independent, and there is no change to learning anything from one using the other.

Consequently, the first point in designing a ML algorithm is to understand if there exists a correlation between the input and the output of the given problem. When this is the case, we say that the problem is learnable.

Machine Learning is about understanding correlations (patterns).

Question 2: It is possible to collect them?

Assume that the problem is learnable. We need to understand if we can physically collect enough data to be able to understand the relationship between him and its corresponding .

For example, if we want to use ML to make cancer diagnosis on patients, clearly the best way to do that is to use as input the clinical results of any possible medical exam on the patient. Of course, even if this will work well in practice, it is not possible (and especially not ethic) to test the patient with thousands of exams for a single diagnosis.

Moreover, to train a good ML model, we will need thousands (sometimes milions) of datapoints, and it is not always possible to scale our problem to be able to collect enough data to solve it.

Collecting data requires efficiency and scalability of the problem.

Collection

Collecting data is usually the hardest part in the design of a Machine Learning production. In fact, given that our problem is solvable and that it is theoretically possible to collect enough data about it, it is not always that easy in practice.

In particular, some data requires time to be collected (this is an example when working in biological or medical applications), and collect good quality data is hard. Indeed, we indeally want to use a clean dataset, where all the informations are presents, there are no missing values (usually referred to as NaN) and the informations does not contain noise. Most of the time, this is hopeless, and we will need to develop algorithms to standardize and clean up the data. The set of all those techniques is called data cleaning, and its study is beyond the scope of this course.

Kaggle

Luckily, for most of the tasks you can think of, you can find datasets on internet. For example, websites like Kaggle and Google Datasets can be helpful for that.

Data loading with pandas

At the end of the introductory post we introduced the Python library pandas, useful to work with data.

In particular, most of the data can be found in the .csv format, and pandas contains functions to read .csv files and work with it. Please refer to the introductory post for more informations about it.

Datasets and numpy arrays

.csv datasets are great. Working with them, we will always have all the informations correctly labeled and in-place. Unfortunately, from a mathematical point of view, this is a really sub-optimal way of working with data. In particular, working with strings is usually a pain and it is mandatory to setup an algorithm converting strings into numbers (an encoding algorithm), and columns and rows names are unnecessary while designing learning algorithms.

Consequently, we will always convert datasets into matrices (into the form of numpy arrays), before starting working with them. This is performed by two successive steps:

  1. Encoding strings into numbers.
  2. Converting the resulting dataset into a numpy array.

The idea of encoding algorithms is that in a dataset, the set of possible values a string can have is limited (e.g. in a dataset containing weather informations, we can say that the climate is {raining, sunny, cloudy, snowy}, thus we have only 4 possible values for the string). Consequently, the idea is to consider each one of the possible values as a class.

Assume our dataset has classes for a specific feature, let’s say is the set of all the classes. Then, there are two mainly used encoding algorithms:

  • Integer encoding: Each class , , is simply mapped to its index (Warning: this method creates a usually unintended ordering on the classes, i.e. if ). In Python, this function is implemented by the function sklearn.preprocessing.LabelEncoder() from sklearn, a famous library performing ML operations.
  • One-hot-encoding: Each class is mapped to the -dimensional canonical vector , where is a vector of all zeros exept for the -th element, which is a 1 (Advantages: this way we can define the concept of being partially in a class). In Python, this function is implemented by the function sklearn.preprocessing.OneHotEncoder().

After the encoding step, the dataset is simply converted to a numpy array with the np.array() function.

The result of this procedure is a matrix

where each column represents a datapoint with features and is the number of datapoints. The corresponding labels for each datapoint are collected into a vector

Design

Designing a ML model is hard and beyond the scope of this course. To us, it is sufficient to understand the main classification in which algorithms are categorized: supervised and unsupervised learning.

Supervised Learning

In Supervised Learning (SL), we are given a dataset composed by a set of inputs and the corresponding labels . The idea of SL techniques is to use informations contained in and to learn structures in data such that, after the training, can estimate new values of given a new .

Unsupervised Learning

In Unsupervised Learning (UL), we are given a dataset composed by only the inputs , without any corresponding labels. The task of UL techniques is to learn pattern present in data with the intent to classify new datum by retrieving its patterns.

Training

Training is the easiest part in the design of ML algorithms. Here, we just use informations contained into the data we have to let our model learn the patterns required to make accurate predictions. Since we are doing an experiment soon, it will be clearer how everything works.

Tuning

Every ML algorithms have a limited number of parameters the user have to set. Generally, those parameters can changes the flexibility of the model, making it more or less flexible depending on the task.

Tuning those parameters is important to improve the accuracy of the algorithm. This is mainly a trial-and-error procedure, where the user try changing the parameters (usually, with knowledge on what they do), and train again the model, check the performance and change the parameters again, until the models does not get good results.

The concept of flexibility is strongly related to the concept of overfitting and underfitting.

Testing

Testing the prediction ability of a ML model on the same dataset on which it has been trained is unfair. Indeed, on those data the model already observed the real outcome, and a model performing well on the training set potentially just memorized each informations contained in the set, without understanding any knowledge. For that reason, it is important to keep a portion of the dataset unused into the Training and Tuning phases to be used to test the model. In particular, when we have available data, it is common to select a number and randomly extract random samples from and use only those data for the training and tuning. The remaining data can be used to test it.

Test usually happens by choosing an accuracy function and evaluating the mean value of over the test set, where is computed between the prediction of the trained model and the true label for the same datum .

For example, in the clustering example we are going to investigate, could be the function associating each point to the corresponding cluster, while maps the input data to an estimate of its potential cluster. When this is the case, we can define the accuracy of the model as the number of datapoints mapped to the correct cluster. In particular, if , is the cluster associated with , then for any ,

If , , is the test set (as defined in the section above), then the accuracy of the model will be

usually referred as misclassification rate. We are going to implement that in Python in the following.