I'm currently working with a dataset in a tibble format with 714 rows (each row corresponds to a new sequence that are specific for a given virus, but multiple sequences are from the same virus if that makes sense).
So if you look in the data, there is e.g. 21 B19 sequences.
I want to make a new column in my tibble where I group all virus-strains that exist few times (lower than 50 counts) into one group ("Others") and where all virus strains with high counts remains in each of their own group so that CMV is CMV. So that will be a new column added to a tibble where everytime a low-count strain occurs, the 'newID' will be others (See fig 1). Until now, I used 'mutate(newID = case_when(Origin == "CMV" ~ "CMV") and then grouped it manually based on counts (see Data figure), but there should be an easier and less 'hard-coding' option, right?
Data:
1 B19 21
2 BKPyV 8
3 CMV 161
4 Covid-19 68
5 EBV 204
6 FLU-A 22
7 HAdV-C 10
8 hCoV 84
9 HHV-1 27
10 HHV-2 3
11 HHV-6B 1
12 HIV-1 18
13 HMPV 3
14 HPV 37
15 JCPyV 4
16 NWV 12
17 unknown 9
18 VACV 9
19 VZV 13
I hope you can help!
You can use
fct_lump()from theforcatspackage (tidyverse).I am using the top 4 viruses based on your count:
Output is:
I used: