Selecting top taxa in phyloseq without having multiple unidentified

35 Views Asked by At

I need to plot my data's community composition but I'm having a bit of a problem when selecting the 20 most abundant taxa.

I have multiple unidentified taxa at various taxa ranks in my taxonomy table (all under the same "unidentified" label). So when I want to select the top 20 taxa for example, I will have only 15 different taxa in the plot, because 6 of the 20 top taxa are "unidentified" ones.

To select my taxa I ran the following bit of code:

ps #phyloseq object
#agglomerate taxa at desired level : Family 
ps_family <-tax_glom(ps, taxrank="Family") 
#then select number of taxa to plot: n=20 
family20 = names(sort(taxa_sums(ps_family), TRUE)[1:20])
#subset phyloseq object to only selected taxa
ps_20 <- prune_taxa(family20, ps) 
#check selected taxa
taxatab <- as.data.frame(tax_table(ps_20))

The head of the taxtab dataframe looks something like this :

taxtab dataframe

As you can see there is multiple unidentified taxa but they don't have the same taxonomy, which is why they were not agglomerated together at the Family level.

So when I plot the composition it gives only 15 families:

community_composition

Don't mind the hundreds of samples haha, that's another problem I know how to solve.

My problem stems from the agglomeration step tax_glom(), because the unidentified Families may have different Order, Class or Phylum, and won't be agglomerated together at the Family level. Which means that when I plot my community composition at the Family level, those different unidentified taxa will be grouped under 1 "unidentified" in the legend.

To be clear, I don't want 6 different unidentified in the legend, but I want all the unidentified at the Family rank to be agglomerated so I can have different 20 taxa in my plot.

I would welcome any help you could provide me :)

0

There are 0 best solutions below