I'm using hclust to perform a cluster analysis of plant species cover data across sampling sites.
My study observed percent cover of 55 species at 100 sites. Plant cover at each site was measured in cover classes of 0-4, where 0 is absent, '1' is 1-25% cover ...'4' is 76-100% cover.
I'm using Euclidian distance to measure species cover dissimilarity between sites, and I want to know which plant species is driving the grouping of each branch of the dendrogram. See sample df & code below; each row represents a site.
In the simplified example, I can see that sp1 is driving the association of sites 3 & 4. In my very large dataset, how could I determine which species is/are driving the associations at different levels of my dendrogram?
Please let me know if I can clarify. Thanks for your help!
library(tidyverse)
site <- c(1:10)
sp1 <- c(0,1,4,4,3,3,2,1,0,2)
sp2 <- c(4,3,0,0,2,2,3,2,1,3)
sp3 <- c(3,2,1,1,2,2,3,2,1,3)
sp4 <- c(2,4,1,0,1,2,3,4,3,1)
df <- data.frame(site, sp1, sp2, sp3, sp4)
species <- select(df, sp1:sp4)
dend <- species %>%
dist(method = "euclidean") %>%
hclust(method = "ward.D") %>%
as.dendrogram()
plot(dend, ylab = "Euclidan Distance")
Following up: I ended up assigning the sites in each cluster to an arbitrary Association group, and then running an indicator species analysis on the Association group using the multipatt function from indicspecies. This allowed me to identify the species that were significantly driving the clustering of the different groups.