I have over a hundred variables for which I'm trying to calculate frequency and percent. How can I maintain the factor order of each variables' values in the output? Please note that specifying the order for each variable outside the dataset is not practical as I have over 100 variables.
Example data:
df <- data.frame(gender=factor(c("male", "female", "male", NA), levels=c("male", "female")),
disease=factor(c("yes","yes","no", NA), levels=c("yes", "no")))
df
gender disease
1 male yes
2 female yes
3 male no
4 <NA> <NA>
Attempt:
df %>% gather(key, value, factor_key = T) %>%
group_by(key, value) %>%
summarise(n=n()) %>%
ungroup() %>%
group_by(key) %>%
mutate(percent=n/sum(n))
Output:
# A tibble: 6 x 4
# Groups: key [2]
key value n percent
<fct> <chr> <int> <dbl>
1 gender female 1 0.25
2 gender male 2 0.5
3 gender NA 1 0.25
4 disease no 1 0.25
5 disease yes 2 0.5
6 disease NA 1 0.25
Desired output would order gender as male, female and disease as yes, no.
Update: if you use pivot_longer (the new gather), it retains the factor levels! You can also fine-tune the column types with arguments names_transform and values_transform in pivot_longer.
Created on 2020-10-16 by the reprex package (v0.3.0)
Because gather drops the factor for the value variable and summarise also appears to drop data frame attributes, you'll have to re-add them. You can re-add them in a semi-automated by reading in and combining the factor levels like this:
Created on 2020-10-16 by the reprex package (v0.3.0)