I'm using R's built-in correlation matrix and hierarchical clustering methods to segment daily sales data into 10 clusters. Then, I'd like to create agglomerated daily sales data by cluster. I've got as far as creating a cutree()
object, but am stumped on extracting only the column names in the cutree
object where the cluster number is 1, for example.
For simplicity's sake, I'll use the EuStockMarkets
data set and cut the tree into 2 segments; bear in mind that I'm working with thousands of columns here so the needs to be scalable:
data=as.data.frame(EuStockMarkets)
corrMatrix<-cor(data)
dissimilarity<-round(((1-corrMatrix)/2), 3)
distSimilarity<-as.dist(dissimilarity)
hirearchicalCluster<-hclust(distSimilarity)
treecuts<-cutree(hirearchicalCluster, k=2)
now, I get stuck. I want to extract only the column names from treecuts
where the cluster number is equal to 1, for example. But, the object that cutree()
makes is not a DataFrame, making sub-setting difficult. I've tried to convert treecuts
into a data frame, but R does not create a column for the row names, all it does is coerce the numbers into a row with the name treecuts
.
I would want to do the following operations:
....Code that converts treecuts into a data frame called "treeIDs" with the
columns "Index" and "Cluster"......
cluster1Columns<-colnames(treeIDs[Cluster==1, ])
cluster1DF<-data[ , (colnames(data) %in% cluster1Columns)]
rowSums(cluster1DF)
...and voila, I'm done.
Thoughts/suggestions?
Here is the solution:
If you want,say, also for the cluster 2 (or higher), you can then use
%in%