I would like to insert new coordinates in my scatterplot, from another matrix. I am using the fviz_cluster function to generate the graph for the clusters. I would like to insert the coordinates of the matrix called Center of mass in my graph, as they are the best coordinates of each cluster for installing a manure composting machine. I can generate the scatter plot only for the properties, as attached. The codes are below:
> library(readxl)
> df <- read_excel('C:/Users/testbase.xlsx') #matrix containing waste production, latitude and longitude
> dim (df)
[1] 19 3
> d<-dist(df)
> fit.average<-hclust(d,method="average")
> clusters<-cutree(fit.average, k=6)
> df$cluster <- clusters # inserting column with determination of clusters
> df
Latitude Longitude Waste cluster
<dbl> <dbl> <dbl> <int>
1 -23.8 -49.6 526. 1
2 -23.8 -49.6 350. 2
3 -23.9 -49.6 526. 1
4 -23.9 -49.6 469. 3
5 -23.9 -49.6 285. 4
6 -23.9 -49.6 175. 5
7 -23.9 -49.6 175. 5
8 -23.9 -49.6 350. 2
9 -23.9 -49.6 350. 2
10 -23.9 -49.6 175. 5
11 -23.9 -49.7 350. 2
12 -23.9 -49.7 175. 5
13 -23.9 -49.7 175. 5
14 -23.9 -49.7 364. 2
15 -23.9 -49.7 175. 5
16 -23.9 -49.6 175. 5
17 -23.9 -49.6 350. 2
18 -23.9 -49.6 45.5 6
19 -23.9 -49.6 54.6 6
> ########Generate scatterplot
> library(factoextra)
> fviz_cluster(list(data = df, cluster = clusters))
>
>
> ##Center of mass, best location of each cluster for installation of manure composting machine
> center_mass<-matrix(nrow=6,ncol=2)
> for(i in 1:6){
+ center_mass[i,]<-c(weighted.mean(subset(df,cluster==i)$Latitude,subset(df,cluster==i)$Waste),
+ weighted.mean(subset(df,cluster==i)$Longitude,subset(df,cluster==i)$Waste))}
> center_mass<-cbind(center_mass,matrix(c(1:6),ncol=1)) #including the index of the clusters
> head (center_mass)
[,1] [,2] [,3]
[1,] -23.85075 -49.61419 1
[2,] -23.86098 -49.64558 2
[3,] -23.86075 -49.61350 3
[4,] -23.86658 -49.61991 4
[5,] -23.86757 -49.63968 5
[6,] -23.89749 -49.62372 6
New scatterplot
Scatterplot considering Longitude and Latitude
vars = c("Longitude", "Latitude")
gg <- fviz_cluster(list(df, cluster = dfcluster), choose.var=vars)
gg
This answer shows the solution using the
fviz_cluster()
function of thefactoextra
package, instead of the mock example included in my previous answer.Starting off from the data frame posted by the OP that already includes the clusters found by
hclust()
andcutree()
:we start by generating the plot of the clusters using
fviz_cluster()
:which gives:
Note that this plot is different from the one shown by the OP. The reason is that the code used by the OP makes the
cluster
variable present indf
to be included in the computation of the principal components on which the plot is based. The reason is that all variables in the input data frame are used to generate the plot. (This conclusion was reached by looking at the source code offviz_cluster()
and running it in debug mode.)Now we compute the
Waste
-weighted center of each cluster as well as the per-cluster average ofWaste
(needed below to add the centers to the plot):(note that the code is now generalized to any number of clusters found)
which gives:
NOW the most interesting part starts: adding the weighted centers to the plot. Since the plot is done on the principal component axes, we need to compute the principal component coordinates for the centers.
This is achieved by running the principal component analysis (PCA) on the full data and applying the PCA axis rotation to the coordinates of the centers. There are at least two functions in the
stats
package of R that can be used to run PCA:prcomp()
andprincomp()
. The preferred method isprcomp()
(as it uses Singular Value Decomposition to perform the eigenanalysis and uses the usualN-1
as divisor for the variance as opposed toN
which is used byprincomp()
). In additionprcomp()
is the function used byfviz_cluster()
.Therefore:
which gives:
Observe that the proportion of the explained variance by the first 2 PCs coincide with those shown in the initial plot of the clusters, namely: 50.1% and 30.1%, respectively.
We now center and scale the weighted centers, using the same center and scaling operation performed on the full data (this is needed for plotting):
Fnally we add the
Waste
-weighted centers to the plot (as red filled points) and theWaste
values as labels. Here we distinguish between number of analyzed variables (nvars) = 2 or > 2, sincefviz_cluster()
only performs PCA when nvars > 2, in the case nvars = 2 it just scales the variables.which gives (when nvars = 3):
Note however that the red points essentially coincide with the original cluster centers computed by
fiz_cluster()
and this is because theWaste
-weighted averages ofLatitude
andLongitude
are almost the same as their respective non-weighted averages (furthermore, the only center that slightly differs between both calculation methods is the center for cluster 2 --as seen by comparing the weighted and unweighted averages per cluster (not done here)).