I am trying to get grouped summary statistics of multiple variables conditional on other different columns. For example, I have three total difference variables (n.diff.total) and three variables indicating each variable should be excluded from the summary statistics or not (na.diff.total = 1, meaning that the observation should be excluded from the calculation). The summary statistics are grouped by id1 variable. Would there be a better and more efficient way to get these values than my current code below?
Sample data frame
set.seed(100)
df <-
data.frame(
id1 = c(rep('A', 10), rep('B', 10)),
id2 = stri_rand_strings(20, 1),
n.diff.total_rare = sample(0:30, 20, replace=TRUE),
n.diff.total_general = sample(0:30, 20, replace=TRUE),
n.diff.total_specialty = sample(0:30, 20, replace=TRUE),
na.diff.total_rare = sample(0:1, 20, replace=TRUE),
na.diff.total_general = sample(0:1, 20, replace=TRUE),
na.diff.total_specialty = sample(0:1, 20, replace=TRUE)
)
Current code and sample output
output_rare <-
df %>%
select(id1, id2, n.diff.total_rare, na.diff.total_rare) %>%
filter(na.diff.total_rare == 0) %>%
mutate(zero = ifelse(n.diff.total_rare == 0, 1, 0)) %>%
group_by(id1) %>%
summarise(
min = min(n.diff.total_rare, na.rm = T),
max = max(n.diff.total_rare, na.rm = T),
sd = sd(n.diff.total_rare, na.rm = T),
mean = mean(n.diff.total_rare, na.rm = T),
zeros = sum(zero, na.rm = T)
) %>%
ungroup %>%
mutate(variable = 'n.diff.total_rare')
output_specialty <-
df %>%
select(id1, id2, n.diff.total_specialty, na.diff.total_specialty) %>%
filter(na.diff.total_specialty == 0) %>%
mutate(zero = ifelse(n.diff.total_specialty == 0, 1, 0)) %>%
group_by(id1) %>%
summarise(
min = min(n.diff.total_specialty, na.rm = T),
max = max(n.diff.total_specialty, na.rm = T),
sd = sd(n.diff.total_specialty, na.rm = T),
mean = mean(n.diff.total_specialty, na.rm = T),
zeros = sum(zero, na.rm = T)
) %>%
ungroup %>%
mutate(variable = 'n.diff.total_specialty')
output_general <-
df %>%
select(id1, id2, n.diff.total_general, na.diff.total_general) %>%
filter(na.diff.total_general == 0) %>%
mutate(zero = ifelse(n.diff.total_general == 0, 1, 0)) %>%
group_by(id1) %>%
summarise(
min = min(n.diff.total_general, na.rm = T),
max = max(n.diff.total_general, na.rm = T),
sd = sd(n.diff.total_general, na.rm = T),
mean = mean(n.diff.total_general, na.rm = T),
zeros = sum(zero, na.rm = T)
) %>%
ungroup %>%
mutate(variable = 'n.diff.total_general')
output <-
output_rare %>%
rbind(
output_specialty
) %>%
rbind(
output_general
)
To combine the three steps into one and output a single dataframe in one go, you can use
pivot_s to group by your three variables:Which gives the same values as your bound dataframes above.