Cluster Analysis Visualisation: Colouring the Clusters after categorial variable

69 Views Asked by At

Salut folks! I'm still quiet new to ggplot and trying to understand, but I really need some help here.

Edit: Reproducible Data of my Dataset "Daten_ohne_Cluster_NA", first 25 rows

structure(list(ntaxa = c(2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 5, 
6, 6, 6, 6, 6, 5, 8, 8, 7, 7, 6, 5, 5), mpd.obs.z = c(-1.779004391, 
-1.721014957, -1.77727283, -1.774642404, -1.789386039, -1.983401439, 
-0.875426386, -2.276052068, -2.340365105, -2.203126078, -2.394158227, 
-2.278173635, -1.269075471, -1.176760985, -1.313045215, -1.164289676, 
-1.247549961, -0.868174033, -2.057106804, -2.03154772, -1.691850922, 
-1.224391713, -0.93993654, -0.39315089, -0.418380361), mntd.obs.z = c(-1.759874454, 
-1.855202792, -1.866281778, -1.798439855, -1.739998395, -1.890847575, 
-0.920672112, -1.381541177, -1.382847758, -1.394870597, -1.339878669, 
-1.349541665, -0.516793786, -0.525476292, -0.557425575, -0.539534996, 
-0.521299478, -0.638951825, -1.06467985, -1.033009266, -0.758380203, 
-0.572401837, -0.166616844, 0.399510209, 0.314591018), pe = c(0.046370234, 
0.046370234, 0.046370234, 0.046370234, 0.046370234, 0.046370234, 
0.071665745, 0.118619482, 0.118619482, 0.118619482, 0.118619482, 
0.118619482, 0.205838414, 0.205838414, 0.205838414, 0.205838414, 
0.205838414, 0.179091659, 0.215719118, 0.215719118, 0.212092271, 
0.315391478, 0.312205596, 0.305510773, 0.305510773), ECO_NUM = c(1, 
6, 6, 1, 7, 6, 6, 6, 6, 6, 6, 7, 7, 6, 1, 6, 6, 6, 6, 6, 6, 7, 
7, 7, 6)), row.names = c(NA, -25L), class = c("tbl_df", "tbl", 
"data.frame"))

(1) I prepared my Dataframe like this:

'Daten_Cluster <- Daten[, c("ntaxa", "mpd.obs.z", "mntd.obs.z", "pe", "ECO_NUM")]

(2) I threw out all the NA's with na.omit. It is 6 variables with 3811 objects each. The column ECO_NUM represents the different ecoregions as a kategorial, numerical factor.

(3) Then I did a Cluster Analysis with k.means. I used 31 groups as there are 31 ecoregions in my dataset and the aim is to colour the plot after ecoregions lateron.

'Biomes_Clus <- kmeans(Daten_Cluster_ohne_NA, 31, iter.max = 10, nstart = 25)

(4) Then I followed the online-instructions from datanovia.com on how to visualise a k.means cluster analysis (I always just follow these How-To s as I have no idea how to do it all by myself). I tried to change the arguments accordingly to colour after ecoregions.

fviz_cluster(Biomes_Clus, data = Daten_Cluster_ohne_NA,
             geom = "point",
             ellipse.type = "convex", 
             ggtheme = theme_bw(),
) +
  stat_mean(aes(color = Daten_Cluster_ohne_NA$ECO_NUM), size = 4)

I get more than 50 warnings here, I guess for each object. Saying: In grid.Call.graphics(C_points, x$x, x$y, x$pch, x$size) : unimplemented pch value '30' I know that there are not enough pch-symbols for 31 groups, but I also don't need them - I just would like to have it coloured.

I also tried out the other function ggscatter and created my own color-palette (called P36) with more than 31 colours to have enough colours for the ecoregions.

ggscatter(
  ind.coord, x = "Dim.1", y = "Dim.2", 
  color = "Species", palette = "P36", ellipse = TRUE, ellipse.type = "convex",
  legend = "right", ggtheme = theme_bw(),
  xlab = paste0("Dim 1 (", variance.percent[1], "% )" ),
  ylab = paste0("Dim 2 (", variance.percent[2], "% )" )
) +
  stat_mean(aes(color = cluster), size = 4)

The Error here is that a Discrete value was supplied to continuous scale. THe Question is: How can I easily colour the outcome of my k.means (which worked) and colour it not by the newly clustered groups but by the ecoregions (to visualise if there is a difference between the clusters and the ecoregion-groups)?

I appreciate your help and me and my group partner would be very thankful!! :) Greetings Evelyn

0

There are 0 best solutions below