R: Grouped summary statistics of multiple variables conditional on other different columns

303 Views Asked by At

I am trying to get grouped summary statistics of multiple variables conditional on other different columns. For example, I have three total difference variables (n.diff.total) and three variables indicating each variable should be excluded from the summary statistics or not (na.diff.total = 1, meaning that the observation should be excluded from the calculation). The summary statistics are grouped by id1 variable. Would there be a better and more efficient way to get these values than my current code below?

Sample data frame

 set.seed(100)
  df <- 
    data.frame(
      id1 = c(rep('A', 10), rep('B', 10)),
      id2 = stri_rand_strings(20, 1),
      n.diff.total_rare = sample(0:30, 20, replace=TRUE),
      n.diff.total_general = sample(0:30, 20, replace=TRUE),
      n.diff.total_specialty = sample(0:30, 20, replace=TRUE),
      na.diff.total_rare = sample(0:1, 20, replace=TRUE),
      na.diff.total_general = sample(0:1, 20, replace=TRUE),
      na.diff.total_specialty = sample(0:1, 20, replace=TRUE)
    )
  

Current code and sample output

 output_rare <-
    df %>% 
    select(id1, id2, n.diff.total_rare, na.diff.total_rare) %>% 
    filter(na.diff.total_rare == 0) %>% 
    mutate(zero = ifelse(n.diff.total_rare == 0, 1, 0)) %>% 
    group_by(id1) %>% 
    summarise(
      min = min(n.diff.total_rare, na.rm = T),
      max = max(n.diff.total_rare, na.rm = T),
      sd = sd(n.diff.total_rare, na.rm = T),
      mean = mean(n.diff.total_rare, na.rm = T),
      zeros = sum(zero,  na.rm = T)
    ) %>% 
    ungroup %>% 
    mutate(variable = 'n.diff.total_rare')
  
  output_specialty <-
    df %>% 
    select(id1, id2, n.diff.total_specialty, na.diff.total_specialty) %>% 
    filter(na.diff.total_specialty == 0) %>% 
    mutate(zero = ifelse(n.diff.total_specialty == 0, 1, 0)) %>% 
    group_by(id1) %>% 
    summarise(
      min = min(n.diff.total_specialty, na.rm = T),
      max = max(n.diff.total_specialty, na.rm = T),
      sd = sd(n.diff.total_specialty, na.rm = T),
      mean = mean(n.diff.total_specialty, na.rm = T),
      zeros = sum(zero,  na.rm = T)
    ) %>% 
    ungroup %>% 
    mutate(variable = 'n.diff.total_specialty')
  
  output_general <-
    df %>% 
    select(id1, id2, n.diff.total_general, na.diff.total_general) %>% 
    filter(na.diff.total_general == 0) %>% 
    mutate(zero = ifelse(n.diff.total_general == 0, 1, 0)) %>% 
    group_by(id1) %>% 
    summarise(
      min = min(n.diff.total_general, na.rm = T),
      max = max(n.diff.total_general, na.rm = T),
      sd = sd(n.diff.total_general, na.rm = T),
      mean = mean(n.diff.total_general, na.rm = T),
      zeros = sum(zero,  na.rm = T)
    ) %>% 
    ungroup %>% 
    mutate(variable = 'n.diff.total_general')
  
  output <- 
    output_rare %>% 
    rbind(
      output_specialty
    ) %>% 
    rbind(
      output_general
    )
2

There are 2 best solutions below

0
Andy Baxter On

To combine the three steps into one and output a single dataframe in one go, you can use pivot_s to group by your three variables:

library(tidyverse)

set.seed(100)
df <- 
  tibble(
    id1 = c(rep('A', 10), rep('B', 10)),
    id2 = sample(letters, 20),
    n.diff.total_rare = sample(0:30, 20, replace=TRUE),
    n.diff.total_general = sample(0:30, 20, replace=TRUE),
    n.diff.total_specialty = sample(0:30, 20, replace=TRUE),
    na.diff.total_rare = sample(0:1, 20, replace=TRUE),
    na.diff.total_general = sample(0:1, 20, replace=TRUE),
    na.diff.total_specialty = sample(0:1, 20, replace=TRUE)
  )


df |> 
  pivot_longer(!starts_with("id"), names_to = c("metric", "variable"), 
               values_to = "val", names_sep = "_") |> 
  pivot_wider(names_from = metric, values_from = val) |> 
  filter(na.diff.total == 0) %>% 
  mutate(zero = ifelse(n.diff.total == 0, 1, 0)) %>% 
  group_by(id1, variable) %>% 
  summarise(
    min = min(n.diff.total, na.rm = T),
    max = max(n.diff.total, na.rm = T),
    sd = sd(n.diff.total, na.rm = T),
    mean = mean(n.diff.total, na.rm = T),
    zeros = sum(zero,  na.rm = T),
    .groups = "drop"
  ) |> 
  arrange(variable, id1)
#> # A tibble: 6 × 7
#>   id1   variable    min   max    sd  mean zeros
#>   <chr> <chr>     <int> <int> <dbl> <dbl> <dbl>
#> 1 A     general      22    30  3.42  26.5     0
#> 2 B     general       6    23  7.37  12.7     0
#> 3 A     rare          1    29 10.6   18.2     0
#> 4 B     rare          1    29 10.1   15.2     0
#> 5 A     specialty     8    23  6.76  15.5     0
#> 6 B     specialty     3    29 10.9   15.7     0

Which gives the same values as your bound dataframes above.

0
Ben On

This might be easier to work with in long form.

First, add a column to indicate the row number, since values in the same row are referred to in filtering.

Then, use pivot_longer to put into long form. The function extract can separate your labels to refer to "n" or "na" as well as type of total (e.g., "total_rare").

Hope this helps.

library(tidyverse)

df %>%
  mutate(rn = row_number()) %>%
  pivot_longer(cols = -c(id1, id2, rn)) %>%
  extract(col = name, into = c("diff", "total"), regex = "^(\\w+).diff.(\\w+)") %>%
  filter(any(value == 0 & diff == "na"), .by = c(id1, id2, total, rn)) %>%
  filter(diff == "n") %>%
  summarise(
    min = min(value, na.rm = T),
    max = max(value, na.rm = T),
    sd = sd(value, na.rm = T),
    mean = mean(value, na.rm = T),
    zeros = sum(value == 0, na.rm = T),
    .by = c(id1, total)
  )

Output

  id1   total             min   max    sd  mean zeros
  <chr> <chr>           <int> <int> <dbl> <dbl> <int>
1 A     total_rare          2    24  8.31  15.5     0
2 A     total_general       1    29 10.1   15.2     0
3 A     total_specialty     6    23  6.95  12       0
4 B     total_rare          1    29 11.6   16.8     0
5 B     total_general      22    30  3.42  26.5     0
6 B     total_specialty     3    23  8.77  14.2     0