Visualizing dyad

Consider an image from skimage.data. For simplicity, say that is the matrix representing that image. You are asked to visualize the dyad of the SVD Decomposition of and the result of compressing the image via SVD. In particular:

Load the image into memory and compute its SVD;

  • Visualize some of the dyad of this decomposition. What do you notice?
  • Plot the singular values of . Do you note something?
  • Visualize the -rank approximation of for different values of . What do you observe?
  • Compute and plot the approximation error for increasing values of , where is the -rank approximation of .
  • Plot the compression factor for increasing . What is the approximation error when the compressed image requires the same amount of information as the uncompressed image (i.e., )?

Classification of MNIST Digits with SVD Decomposition

The task for this exercise is to learn the classification of MNIST digits by using SVD decomposition. Remember that given a matrix and its SVD decomposition , we can prove that an orthogonal base for the space of the columns is given by the first columns of the matrix , where is equal to the number of non-zero singular values of .

We will make use of the space of the columns defined by the matrix and the following theorem:

THEOREM 1

Let be a subspace of where and an orthogonal base of . Given a generic we have that the projection of onto has the following form:

COROLLARY 1.1.

If is a given matrix with SVD decomposition , since is the dimension of the space defined by the columns of and the columns of , are an orthonormal basis for that space, the projection of an -dimensional vector on this space can be easily computed as:

Thus, consider a binary classification problem, where we want to classify if a given digit of dimension represents the number 3 or the number 4. We will refer to the class of the number 3 as , and to the class of the number 4 as . Suppose that is the number of elements in , while is the number of elements in . If is the matrix such that its columns are a flattened version of each digit in , is the matrix such that its columns are a flattened version of each digit in , and consider

the SVD decomposition of the two matrices.

If is a new, unknown digit, we can classify it by first flattening it to a vector in , then project it to the spaces of and and call them

Thus, will be classified as if and vice versa will be classified as if . We want to implement this idea in Python.

Binary Classification of MNIST Digits 3 and 4

In this first exercise, we will implement the binary classification algorithm for the digits 3 and 4 of MNIST following the ideas explained above.

  • Load the MNIST dataset contained in ./data/MNIST.mat with the function scipy.io.loadmat. This dataset, which is loaded in the form of a matrix , contains the flattened version of 1707 grayscale handwritten digits. Moreover, from the same file, it is possible to load a vector of length 1707 such that the -th element of is the true digit represented by the -th image of .
  • Visualize a bunch of data points of with the function plt.imshow}.
  • Extract from those columns that correspond to digits 3 or 4. Those digits represent the classes and defined above.
  • Split the obtained dataset into training and testing sets. From now on, we will only consider the training set. The test set will be used at the end of the exercise to test the algorithm.
  • Create the matrices and defined above from .
  • Compute the SVD decomposition of and with np.linalg.svd(matrix, full_matrices=False)} and denote the -part of the two decompositions as and .
  • Take an unknown digit from the test set, and compute and .
  • Compute the distances and and classify to if and to if d_1$.
  • Repeat the experiment for different values of in the test set. Compute the misclassification number for this algorithm.
  • Repeat the experiment for different digits other than 3 or 4. Is there a relationship between the visual similarity of the digits and the classification error?
  • Comment on the obtained results.

Extension to Multiple Classification

The extension of this idea to the multiple classification task is trivial. Indeed, if we have more than 2 classes (say, different classes) , we just need to repeat the same procedure as before for each matrix to obtain the distances . Then, the new digit will be classified as if is lower than for each . Repeat the exercise above with a 3-digit example. Comment on the differences.

Clustering with PCA

The task for this exercise is to verify the ability of PCA in clustering data by projecting very high-dimensional data points to 2 or 3 dimensions. In particular, consider the dataset MNIST provided on Virtuale. This dataset contains images of handwritten digits with dimension , together with a number from 0 to 9 representing the label. You are asked to:

  • Load the dataset in memory and explore its head and shape to understand how the information is placed inside of it;
  • Split the dataset into the matrix of dimension , with being the dimension of each datum, is the number of datapoints, and containing the corresponding labels;
  • Choose a number of digits (for example, 0, 6, and 9) and extract from and the sub-dataset containing only the considered digits. Re-call and those datasets since the originals are not required anymore;
  • Set and randomly sample a training set with datapoints from (and the corresponding ). Call them and . Everything else is the test set. Call it and .
  • Implement the algorithms computing the PCA of with a fixed . Visualize the results (for ) and the position of the centroid of each cluster;
  • Compute, for each cluster, the average distance from the centroid. Comment on the result;
  • Compute, for each cluster, the average distance from the centroid on the test set. Comment on the results;
  • Define a classification algorithm in this way: given a new observation , compute the distance between and each cluster centroid. Assign to the class corresponding to the closer centroid. Compute the accuracy of this algorithm on the test set and compute its accuracy;
  • Repeat this experiment for different values of and different digits. What do you observe?

Other assignments: