Interpreting first few PCA components for handwritten digit recognition

1.3k Views Asked by At

So in Matlab I perform PCA on handwritten digits. Essentially, I have say 30*30 dimensional pictures, i.e. 900 pixels, and I consider after PCA the components which capture most of the variance, say the first 80 principal components(PC) based on some threshold. Now these 80 PCs are also of 900 dimension, and when i plot these using imshow, I get some images, like something looking like a 0, 6, 3, 5 etc. What is the interpretation of these first few of the PCs (out of the 80 I extracted)?

2

There are 2 best solutions below

7
Konstantin On

PCA extracts the most important information from the data set and compresses the size of the data set by keeping only the important information - principal components.

The first principal component is constructed in such a way that it has the largest possible variance. The second component is computed under the constraint of being orthogonal to the first component and to have the largest possible variance.

In your case the data is a set of images. Let's say you have 1000 images and you compute the first five principal components (5 images, constructed by the PCA algorithm). You may represent any image as 900 data points (30x30 pixels) or by the combination of 5 images with the corresponding miltiplication coefficients.

The goal of the PCA algorithm is to construct these 5 images (principal componenets) in such a way that images in your data set are represented most accurately with the combination of the given number of principal components.

UPDATE:

Consider the image below (from the amazing book by Kevin Murphy). The image shows how points in 2 dimensions (red dots) are represented in 1 dimension (green crosses) by projecting them to the vector (purple line). The vector is the first principal component. The purpose of PCA is to build these vectors to minimize the reconstruction error. In your case these vectors can be represented as images.

enter image description here

You may refer to this article for more details on using PCA for handwritten digit recognition.

10
A. Donda On

First a bit of nomenclature: PCA finds the eigenvectors of the data covariance matrix, and then transforms the data into the basis of eigenvectors. The result of this operation has two parts: The transformed data, and the eigenvectors used for transformation. Usually it is the first that is called principal components (PCs). There is no established name for the second, but I like the term principal modes (PMs). Your question is about the interpretation of the PMs.

The interpretation of PMs is generally hard, if not impossible. Of course they have a simple technical interpretation, PCA looks for directions in the data space along which a maximum amount of variation occurs, and the PMs are those directions. For the first component this maximization is free, and captures the main direction along which variation occurs. Later components are constrained to be orthogonal to the ones before, which often leads to more and more complex, high-frequency patterns which are harder and harder to interpret.

In the case of your data set, the situation may be different, because it probably has some kind of cluster structure in a very high-dimensional space, where the clusters correspond to the digits 0–9. It has been observed that in such a case there is a weak correspondence between PCA and k-means clustering, such that the first PMs tend to recover the space spanned by the cluster centroids. In this case the first PMs will be mixtures of cluster centroid patterns, possibly even approximately coinciding with these patterns. I think this is what explains your observation.


More information in response to the OP's comments:

The Wikipedia snippet cited above refers to Ding and He (2004), K-means Clustering via Principal Component Analysis (link). They write in the abstract: "Here we prove that principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering." To my understanding this means that the "component load", the value of the principal component for a given data point is or at least is related to an indicator of whether that data point belongs to a cluster. They continue, "Equivalently, we show that the subspace spanned by the cluster centroids are given by spectral expansion of the data covariance matrix truncated at K − 1 terms." This means that the subspace of the data space (in your case 900-dimensional) that is spanned by the first K – 1 principal modes (eigenvectors) is or is close to the space spanned by the (differences between) the cluster centroids (the mean image for each digit). If this is the case, most of the between-cluster variance is captured by those first principal components.

Think of it this way: The PCA allows to reduce the 900 dimensions of the data to about 10 dimensions, by reconstructing all 30x30 images from a set of 10 "typical" 30x30 images. That means, each image can be approximately encoded by 10 numbers instead of 900 numbers. In order for this to be possible, the "typical" images have to be similar to how a "0", a "1", etc. looks like on average. In the simplest case, the 10 images could just be the average "0", average "1" etc. itself. That is not normally the case, but it can be approximately the case. Does this help? That "0" corresponds to the strongest PC is just coincidence I think.