I need to plot my data's community composition but I'm having a bit of a problem when selecting the 20 most abundant taxa.
I have multiple unidentified taxa at various taxa ranks in my taxonomy table (all under the same "unidentified" label). So when I want to select the top 20 taxa for example, I will have only 15 different taxa in the plot, because 6 of the 20 top taxa are "unidentified" ones.
To select my taxa I ran the following bit of code:
ps #phyloseq object
#agglomerate taxa at desired level : Family
ps_family <-tax_glom(ps, taxrank="Family")
#then select number of taxa to plot: n=20
family20 = names(sort(taxa_sums(ps_family), TRUE)[1:20])
#subset phyloseq object to only selected taxa
ps_20 <- prune_taxa(family20, ps)
#check selected taxa
taxatab <- as.data.frame(tax_table(ps_20))
The head of the taxtab dataframe looks something like this :
As you can see there is multiple unidentified taxa but they don't have the same taxonomy, which is why they were not agglomerated together at the Family level.
So when I plot the composition it gives only 15 families:
Don't mind the hundreds of samples haha, that's another problem I know how to solve.
My problem stems from the agglomeration step tax_glom(), because the unidentified Families may have different Order, Class or Phylum, and won't be agglomerated together at the Family level.
Which means that when I plot my community composition at the Family level, those different unidentified taxa will be grouped under 1 "unidentified" in the legend.
To be clear, I don't want 6 different unidentified in the legend, but I want all the unidentified at the Family rank to be agglomerated so I can have different 20 taxa in my plot.
I would welcome any help you could provide me :)