calculate annual cases after reshaping dataframe in R

47 Views Asked by At

I am working in R and I want to calculate the last column Annual_Cases based on the available years for each surgeon. The variable Years was reported per each 3-year time interval. Some surgeons were working in more than 1 hospital.

Here is my code that needs edits and my data

library(dplyr);library(tidyr)
df %>% drop_na(Cases) %>% group_by(Surgeon)%>% summarise(Annaul_Cases= sum(Cases))/n,  na.rm=T)-> df # I need to calculate n column (number of years the surgeon were in service)


structure(list(Serial.ID = c(215, 522, 903, 1210, 1591, 1898, 
2279, 2586, 2967, 3274, 3655, 3962, 4343, 4650, 5031, 5338, 330, 
1018, 1706, 2394, 3082, 3770, 4458, 5146), Surgeon = c("A", "A", 
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
"A", "B", "B", "B", "B", "B", "B", "B", "B"), Center = c("Maimonides Medical Center", 
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center", 
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center", 
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center", 
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center", 
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center", 
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center", 
"St. Vincents Hospital and Medical Center", "Maimonides Medical Center", 
"St. Vincents Hospital and Medical Center", "Mount Sinai Hospital", 
"Mount Sinai Hospital", "Mount Sinai Hospital", "Mount Sinai Hospital", 
"Mount Sinai Hospital", "Mount Sinai Hospital", "Mount Sinai Hospital", 
"Mount Sinai Hospital"), Years = c("1996-1998", "1996-1998", 
"1999-2001", "1999-2001", "2002-2004", "2002-2004", "2005-2007", 
"2005-2007", "2008-2010", "2008-2010", "2011-2013", "2011-2013", 
"2014-2016", "2014-2016", "2017-2019", "2017-2019", "1996-1998", 
"1999-2001", "2002-2004", "2005-2007", "2008-2010", "2011-2013", 
"2014-2016", "2017-2019"), Cases = c(377, 19, 223, NA, 27, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 110, 18, 13, 
17, 4, 1), Deaths = c("10", "1", "6", "NA", "0", "NA", "NA", 
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", 
"1", "0", "0", "0", "0", "0"), Annual_Cases = c("NA", "NA", "NA", 
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", 
"NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA")), row.names = c(NA, 
-24L), class = c("tbl_df", "tbl", "data.frame"))

Here is my desired output and my used manual formula for that.
enter image description here

Any advice will be greatly appreciated

3

There are 3 best solutions below

0
zephryl On BEST ANSWER

First summarize to get sums for each year span, then summarize again to aggregate over all year spans.

library(dplyr)
library(tidyr)

df %>% 
  drop_na(Cases) %>% 
  summarize(Cases = sum(Cases), .by = c(Surgeon, Years)) %>%
  summarize(Annual_Cases = sum(Cases) / (n() * 3), .by = Surgeon)

Result:

# A tibble: 2 × 2
  Surgeon Annual_Cases
  <chr>          <dbl>
1 A              71.8 
2 B               9.06

This solution assumes each year span is always 3 years. If not, you’ll need to use an approach like @JonSpring’s using separate().

0
Adriano Mello On

You can use dplyr::n_distinct() to get the number of unique triennium. Since there's already answers with summarise, I´ll use mutate, so the output will look more like your spreadsheet.

library(tidyverse)

df1 <- df %>% 
  filter(!is.na(Cases)) %>% 
  mutate(
    .by = c(Surgeon),
    Annual_Cases = sum(Cases, na.rm = TRUE)/(3 * n_distinct(Years)))

# ---------------   
> df1 %>% 
    select(1:2, last_col()) %>% 
    slice_head(by = Surgeon, n = 3)

# A tibble: 6 × 3
  Serial.ID Surgeon Annual_Cases
      <dbl> <chr>          <dbl>
1       215 A              71.8 
2       522 A              71.8 
3       903 A              71.8 
4      1706 B               9.06
5      2394 B               9.06
6      3082 B               9.06
0
Friede On

If Years always spans over 3 years, try:

> X = aggregate(Cases ~ Surgeon + Years, data = df1, sum) 
> tapply(X$Cases, X$Surgeon, \(x) round(sum(x) / (length(x) * 3L)))

 A  B 
72  9