What and how to interpret scatter_3d plot?

238 Views Asked by At

I have a subset of the MNIST handwritten digits dataset. I'm trying to reduce the dimensions using PCA, kernel pca, lle and tsne while plotting the result usign Plotly.express.scatter_3d. But as a beginner, I don't know how to interpret from the figure. Please guide me.

pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_train)
X_pca_r = pca.inverse_transform(X_pca)

import plotly.express as px
fig = px.scatter_3d(X_pca, x=X_pca[:,0], y=X_pca[:,1], z=X_pca[:,2], color=y_train)
fig.show()

I have the following figure

pca plot

Then, using KernelPCA:

from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components=3, fit_inverse_transform=True)
X_kpca = kpca.fit_transform(X_train)
X_kpca_r = kpca.inverse_transform(X_kpca)
px.scatter_3d(X_kpca, x=X_kpca[:,0], y=X_kpca[:,1], z=X_kpca[:,2], color=y_train).show()

results in this figure:

kernel pca plot

Similarly, using LocallyLinearEmbedding:

from sklearn.manifold import LocallyLinearEmbedding
lle = LocallyLinearEmbedding(n_components=3)
X_lle = lle.fit_transform(X_train)
px.scatter_3d(X_lle, x=X_lle[:,0], y=X_lle[:,1], z=X_lle[:,2], color=y_train).show()

results in the following figure:

lle plot

Lastly, using TSNE:

from sklearn.manifold import TSNE
tsne = TSNE(n_components=3)
X_tsne = tsne.fit_transform(X_train)
px.scatter_3d(X_tsne, x=X_tsne[:,0], y=X_tsne[:,1], z=X_tsne[:,2], color=y_train).show()

results in the following figure:

tsne plot

1

There are 1 best solutions below

1
On

Please feel free to comment if I misunderstood your question, I would very much try to condense the answer, if you tell the specific part that is troubling

In my experience, 3 dimensions will not be enough to classify handwritten digits very well, in the same way that a 3-pixel display will not be able to represent digits in a way that resembles how the digits look, when written by hand. This is why the graphs might not intuitively make sense (although points of same colour, corresponding to the digit, are somewhat grouped in the graphs, for example the yellow spherres, which are the digit 9.)

In other datasets, where 3 features is enough to classify the data, you might see that the data forms distinct clusters. The larger the distance between the clusters (intracluster distance), and the smaller the distance between points in the same cluster (intercluster distance), the better. A much used example is the Iris flower dataset:

Data: https://www.kaggle.com/datasets/arshid/iris-flower-dataset

Example, with visualistion: https://www.kaggle.com/code/imdevskp/plotly-express-3d-scatter-plot-iris-data/notebook

This page shows the concepts of cluster distances quite well: https://www.geeksforgeeks.org/ml-intercluster-and-intracluster-distance/

The figures are 2-dimensional, but the basic principles work in higher dimensions.

I would recommend that you look into numerical indicators rather than figures, as most problems works best with more than 3 dimensions, which can't be shown on a figure.

In continuation of this, you should also look into how the packages can show the significance of each principal component/dimension, to better determine how many features to include in the analysis.

Lastly, I would recommend that you adjust the size of the spheres in your graphs, so that they do not overlap eachother as much, although it is difficult with a large number of datapoints.