Collapse rows that match in more than one column

64 Views Asked by At

I have a data frame where in the first column I have a Patient ID and in the second column I have a Date (date time) when a specific test was taken. The columns thereafter are columns of the specific tests (b, a, c, d). At the moment, for every Patient ID and every Date I have a row with the results of the specific test and NA in the columns of all other tests.

enter image description here

I want to collapse the data frame in a way that whenever Patient ID and Date match, the test results are collapsed into one row only.

I have tried aggregate(), collapse(), and group_by(), but did not manage to do this. The functions are only activated when Patient ID and Date match, not just one of the columns.

2

There are 2 best solutions below

0
Friede On

Data copied from picture:

X = read.table(text = "PatientID Date b a c d
               1 2024-03-0420:16:11 0.4 NA NA NA 
               1 2024-03-0422:40:11 NA TRUE NA NA 
               1 2024-03-0522:40:11 NA NA FALSE NA
               1 2024-03-0622:40:11 NA NA NA Strong
               2 2024-03-0422:40:11 NA NA NA Weak
               2 2024-03-0522:40:11 NA NA FALSE NA
               3 2024-03-0622:40:11 0.6 NA NA NA
               3 2024-03-0622:40:11 NA FALSE NA NA", header = TRUE)

Extract Date (please specify tz (time zone)). You probably need to adapt format. This step is needed due the bad data structure (I needed to copy the data from picture). Then aggregate:

X$Date = as.Date(as.POSIXct(X$Date, format = "%Y-%m-%d%H:%M:%S"), tz = "")
aggregate(. ~ PatientID + Date, data = X, FUN = na.omit, na.action = "na.pass")

gives

  PatientID       Date   b     a     c      d
1         1 2024-03-04 0.4  TRUE             
2         2 2024-03-04                   Weak
3         1 2024-03-05           FALSE       
4         2 2024-03-05           FALSE       
5         1 2024-03-06                 Strong
6         3 2024-03-06 0.6 FALSE             

If you don't like the order, change it. E.g. PatientID in ascending order as the raw data (X) is organised:

> Y = aggregate(. ~ PatientID + Date, data = X, FUN = na.omit, na.action = "na.pass")
> Y[order(Y$PatientID), ]
  PatientID       Date   b     a     c      d
1         1 2024-03-04 0.4  TRUE             
3         1 2024-03-05           FALSE       
5         1 2024-03-06                 Strong
2         2 2024-03-04                   Weak
4         2 2024-03-05           FALSE       
6         3 2024-03-06 0.6 FALSE     
2
Chris Ruehlemann On

Not exactly sure what you need; here's a tidyverse solution:

X %>%
  group_by(PatientID, Date) %>%
  summarise(
            a = str_c(a[!is.na(a)], collapse = ","),
            b = str_c(b[!is.na(b)], collapse = ","),
            c = str_c(c[!is.na(c)], collapse = ","),
            d = str_c(d[!is.na(d)], collapse = ",")) %>%
  mutate(across(a:d, ~na_if(., "")))
# A tibble: 7 × 6
# Groups:   PatientID [3]
  PatientID Date               a     b     c     d     
      <int> <chr>              <chr> <chr> <chr> <chr> 
1         1 2024-03-0420:16:11 NA    0.4   NA    NA    
2         1 2024-03-0422:40:11 TRUE  NA    NA    NA    
3         1 2024-03-0522:40:11 NA    NA    FALSE NA    
4         1 2024-03-0622:40:11 NA    NA    NA    Strong
5         2 2024-03-0422:40:11 NA    NA    NA    Weak  
6         2 2024-03-0522:40:11 NA    NA    FALSE NA    
7         3 2024-03-0622:40:11 FALSE 0.6   NA    NA 

Data (thks to @Friede):

X = read.table(text = "PatientID Date b a c d
               1 2024-03-0420:16:11 0.4 NA NA NA 
               1 2024-03-0422:40:11 NA TRUE NA NA 
               1 2024-03-0522:40:11 NA NA FALSE NA
               1 2024-03-0622:40:11 NA NA NA Strong
               2 2024-03-0422:40:11 NA NA NA Weak
               2 2024-03-0522:40:11 NA NA FALSE NA
               3 2024-03-0622:40:11 0.6 NA NA NA
               3 2024-03-0622:40:11 NA FALSE NA NA", header = TRUE)