I have two dataframes, which are based on a third, larger dataset. I want to normalize the data in one dataframe according to the entries in the second dataframe - My favorite would be to use dplyr, but other packages/solutions are very appreciated, too :)
In my first dataframe, I have the counts of different organs.
Dataframe organ_count
# A tibble: 5 x 2
organs count
<fctr> <int>
1 Organ_A 23
2 Organ_B 29
3 Organ_C 24
4 Organ_D 145
5 Organ_E 97
In my second dataframe, I have the count of the same organs, but splitted upon in which state they appear in the large dataset I used as a source.
Dataframe organ_state_count
# A tibble: 15 x 3
organs hmm_state count
<fctr> <chr> <int>
1 Organ_A E1 12
2 Organ_A E2 2
3 Organ_A E3 9
4 Organ_B E1 13
5 Organ_B E2 10
6 Organ_B E3 6
7 Organ_C E1 7
8 Organ_C E2 7
9 Organ_C E3 10
10 Organ_D E1 72
11 Organ_D E2 23
12 Organ_D E3 50
13 Organ_E E1 90
14 Organ_E E2 2
15 Organ_E E3 5
What I want to do now is:
I want to divide organ_state_count$count by the total number of entries for this organ (given in organ_state), resulting in the percentage of this organ for the given state.
I already tried something like this:
organ_state_count %>%
rowwise() %>%
do(organ_total = filter(organ_count,organs == .$organs)) %>%
mutate(organ_norm=.$count/organ_total)
But it throws this error message:
Error in mutate_impl(.data, dots) :
Evaluation error: arguments imply differing number of rows: 1, 0.
In addition: Warning messages:
1: Unknown or uninitialised column: 'count'.
2: In Ops.factor(left, right) : ‘/’ not meaningful for factors
I must admit I'm fairly new to R and to the whole dplyr/tidyverse thing as well, so I'm a bit overwhelmed.
I also think that there is some kind of possibility of just using the organ_state_count frame for this task, and solve everything in just one dataframe, but I'm not sure how.
Thanks for your answers and help!
you can try something like:
There's no need to use the first dataframe, as you have that information in the second dataframe already. Just select the columns you want to use for the final output.
data:
output: