I am working with a relatively big data set (only using about 1/32 of it, but this subset is approx. 50000x9000). In order to perform analysis on this, I have taken several steps to reduce the dimensionality, so that I can then apply some sort of clustering algorithm.
Take a look at the following data frame:
set.seed(340)
df = data.frame(replicate(10,sample(0:10,size = 10,replace = TRUE)))
> df
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 4 9 4 6 9 4 2 5 8 8
2 5 8 2 0 4 6 1 1 0 10
3 1 7 6 3 5 9 6 0 7 1
4 0 6 8 6 6 0 5 5 10 10
5 2 0 5 8 2 10 8 2 1 5
6 3 9 10 2 8 5 2 10 3 10
7 9 0 1 0 6 8 9 6 5 0
8 5 6 9 3 10 4 4 8 6 9
9 8 7 6 2 10 9 9 7 1 10
10 0 7 2 6 1 6 3 2 3 9
Each row represents a person, and each variable says how often that person exhibited that quality. Say I perform principal component analysis on this using princomp(), and collect the first four pc's to use for k means.
pc = princomp(df)
new_df = cbind(pc$loadings[,1],pc$loading[,2],pc$loadings[,3],pc$loadings[,4])
fit = kmeans(new_df,2)
From this I can deduce what cluster exhibits high values of what principal components, where I can use the loadings to see what each principal component is a general measure off. However, I would like to ultimately connect this information to my original data set. Is there a way that I can cluster each person in the original data to a cluster created from the k means on the principal component analysis? Or am I misunderstanding the concept of PCA.
pc$loadings
finds the coordinates of the input variables, not that of the individuals. Sokmeans(new_df,2)
classifies variables and not individuals. To make sure of this, try your code with a 10x5 data.frame instead of a 10x10 one : you only get 5 cluster coordinates:If that is what you want to do, then you can just
rbind
fit$cluster
to your original data.frame and you will have the cluster of your variables.However, if you intended to cluster individuals, i.e. rows of your original data.frame, you need to perform the clustering on the row coordinates produced by the principal component analysis. I don't know how to access those in
princomp
, but other pca methods allow this easily.FactoMineR::PCA
outputs a list with row coordinates ($ind$coord
) and column coordinates ($var$coord
).To add those to your original data.frame: