Identifying and consolidating duplicated observations

46 Views Asked by At

I'm working on the data for a clinical trial. In this data, there are multiple observations that refer to the same patient and they have varying degrees of missing data. I have four variables which I'm using as identifiers and I've decided that, if a patient has identical values for three out of the four variables, that he/she should be considered a duplicate.

The degree of missing data in all of these variables is extremely low, so there's not an issue of a patient having three NA's and being considered a duplicate of another one with three NA's.

I've been able to remove the duplicates by creating the combinations of variables and using the dplyr::distinct() . Unfortunately, this leads to one of the observations being kept, which might have missing data that an observation for the same patient had, but was excluded. I'll demonstrate with some fake data:

library(tidyverse)
library(knitr)

tibble(Name = c("Pedro", "Pedro","Pedro","Cristina","Walter","Sara","Mateus"),
       Birth = dmy(c("29/07/1994","29/07/1994","29/07/1994","01/04/1960", "22/12/1956", "20/02/1997","25/07/1994")),
       CNS = c("700",NA,"700","701","702","703","704"),
       Document = c("104","104",NA,"105","106","107","108"),
       SystolicBP1 = c(NA,NA,120,160,152,114,NA),
       DiastolicBP1 = c(NA,NA,80,100,100,92,NA),
       SystolicBP2 = c(NA,NA,NA,148,NA,NA,100),
       DiastolicBP2= c(NA,NA,NA,90,NA,NA,82),
       HBA1c = c(7,7,7,8.2,8,9,6.5)) -> dt

kable(dt)
Name Birth CNS Document SystolicBP1 DiastolicBP1 SystolicBP2 DiastolicBP2 HBA1c
Pedro 1994-07-29 700 104 NA NA NA NA 7.0
Pedro 1994-07-29 NA 104 NA NA NA NA 7.0
Pedro 1994-07-29 700 NA 120 80 NA NA 7.0
Cristina 1960-04-01 701 105 160 100 148 90 8.2
Walter 1956-12-22 702 106 152 100 NA NA 8.0
Sara 1997-02-20 703 107 114 92 NA NA 9.0
Mateus 1994-07-25 704 108 NA NA 100 82 6.5

I can then run the following to get rid of the duplicates, but I end up losing some data:

identifiers <- c("Name","Birth","CNS","Document")

combn(identifiers, 3) -> comb

kable(comb)
Name Name Name Birth
Birth Birth CNS CNS
CNS Document Document Document
## this takes the result of distinct() for each 
## combination of the 4 identifiers, taken 3 at a time
dt %>% 
        distinct (!!! syms(comb[,1]), .keep_all = TRUE) %>%
        distinct (!!! syms(comb[,2]), .keep_all = TRUE) %>%
        distinct (!!! syms(comb[,3]), .keep_all = TRUE) %>%
        distinct (!!! syms(comb[,4]), .keep_all = TRUE) %>%
        kable()
Name Birth CNS Document SystolicBP1 DiastolicBP1 SystolicBP2 DiastolicBP2 HBA1c
Pedro 1994-07-29 700 104 NA NA NA NA 7.0
Cristina 1960-04-01 701 105 160 100 148 90 8.2
Walter 1956-12-22 702 106 152 100 NA NA 8.0
Sara 1997-02-20 703 107 114 92 NA NA 9.0
Mateus 1994-07-25 704 108 NA NA 100 82 6.5

My goal is, when faced with missing data, to try to get the missing data from the duplicate observations. If I could do this, in the example above the first observation would've gotten the values for "SystolicBP1" and "DiastolicBP1" from the third observation in the original tibble.

2

There are 2 best solutions below

2
jay.sf On

We could create strings out of the first four ident-columns and run hclust based on string distances from adist to get ids. Next, using by, we split the data along these ids and reduce to one row, leaving either first value or NA.

> cl_fun <- \(data, h., ident=c("Name", "Birth", "CNS", 'Document')) {
+   strings <- Reduce(paste0, data[ident])  ## create ident strings
+   ld  <- adist(tolower(strings))  ## distance matrix
+   hc <- hclust(as.dist(ld))  ## cluster analysis
+   id <- cutree(hc, h=h.)  ## cut tree to groups
+   within(data, id <- id)  ## add id to data
+ }
> na_fun <- \(.) {
+   o <- vapply(., \(x) {
+     o <- if (all(is.na(x))) {
+       NA_character_
+     } else {
+       na.omit(as.character(x))[[1]]
+     }
+   }, FUN.VALUE=character(1))
+   o <- type.convert(as.data.frame(t(o)), as.is=TRUE)
+   within(o, {Birth <- as.Date(Birth)})
+ }
> by(cl_fun(df, h.=8.0), ~id, na_fun) |> do.call(what='rbind')
      Name      Birth CNS Document SystolicBP1 DiastolicBP1 SystolicBP2 DiastolicBP2 HBA1c id
1    Pedro 1994-07-29 700      104         120           80         100           NA   7.0  1
2 Cristina 1960-04-01 701      105         160          100         148           90   8.2  2
3   Walter 1956-12-22 702      106         152          100          NA           NA   8.0  3
4     Sara 1997-02-20 703      107         114           92          NA           NA   9.0  4
5   Mateus 1994-07-25 704      108          NA           NA         100           82   6.5  5
6      Ana       <NA>  NA       NA         130           94         100           89    NA  6

Ordep is considered as a duplicate of Pedro, and Laura is considered as a duplicate of Ana, which is fine according to your statements. As tuning parameter in cl_fun() I use h from cutree(), which specifies the height if you plot hc, you could also try k instead, which would specify the number of clusters with the disadvantage, though, that k is an integer and less flexible. You could also specify more arguments of adist() such as costs to fine-tune. In this h=8.0 produced the wanted result.


Data:

> dput(df)
structure(list(Name = c("Pedro", "Pedro", "Pedro", "Ordep", "Cristina", 
"Walter", "Sara", "Mateus", "Ana", "Ana", "Laura"), Birth = structure(c(8975, 
8975, 8975, 8975, -3562, -4758, 9912, 8971, NA, NA, NA), class = "Date"), 
    CNS = c("700", "<NA>", "700", "700", "701", "702", "703", 
    "704", NA, NA, NA), Document = c("104", "104", "<NA>", "104", 
    "105", "106", "107", "108", NA, NA, NA), SystolicBP1 = c(NA, 
    NA, 120L, NA, 160L, 152L, 114L, NA, NA, 130L, 120L), DiastolicBP1 = c(NA, 
    NA, 80L, NA, 100L, 100L, 92L, NA, 94L, NA, 96L), SystolicBP2 = c(NA, 
    NA, NA, 100L, 148L, NA, NA, 100L, 100L, 110L, 110L), DiastolicBP2 = c(NA, 
    NA, NA, NA, 90L, NA, NA, 82L, 89L, NA, 91L), HBA1c = c(7, 
    7, 7, NA, 8.2, 8, 9, 6.5, NA, NA, NA)), row.names = c(NA, 
-11L), class = "data.frame")

Looks like:

> df
       Name      Birth  CNS Document SystolicBP1 DiastolicBP1 SystolicBP2 DiastolicBP2 HBA1c
1     Pedro 1994-07-29  700      104          NA           NA          NA           NA   7.0
2     Pedro 1994-07-29 <NA>      104          NA           NA          NA           NA   7.0
3     Pedro 1994-07-29  700     <NA>         120           80          NA           NA   7.0
4     Ordep 1994-07-29  700      104          NA           NA         100           NA    NA
5  Cristina 1960-04-01  701      105         160          100         148           90   8.2
6    Walter 1956-12-22  702      106         152          100          NA           NA   8.0
7      Sara 1997-02-20  703      107         114           92          NA           NA   9.0
8    Mateus 1994-07-25  704      108          NA           NA         100           82   6.5
9       Ana       <NA> <NA>     <NA>          NA           94         100           89    NA
10      Ana       <NA> <NA>     <NA>         130           NA         110           NA    NA
11    Laura       <NA> <NA>     <NA>         120           96         110           91    NA
3
r2evans On

If what you're trying to do is take the first non-NA value from each group, then we can do

library(dplyr)
dt |>
  summarize(
    .by = c(Name, Birth),
    across(everything(), ~ na.omit(.x)[1])
  )
# # A tibble: 5 × 9
#   Name     Birth      CNS   Document SystolicBP1 DiastolicBP1 SystolicBP2 DiastolicBP2 HBA1c
#   <chr>    <date>     <chr> <chr>          <dbl>        <dbl>       <dbl>        <dbl> <dbl>
# 1 Pedro    1994-07-29 700   104              120           80          NA           NA   7  
# 2 Cristina 1960-04-01 701   105              160          100         148           90   8.2
# 3 Walter   1956-12-22 702   106              152          100          NA           NA   8  
# 4 Sara     1997-02-20 703   107              114           92          NA           NA   9  
# 5 Mateus   1994-07-25 704   108               NA           NA         100           82   6.5

This (silently) discards multiple rows for a Name, Birth pair that may have differing values (not present in the sample data). We can show this by changing one value and seeing that it is dropped:

dt2 <- mutate(dt, Document = replace(Document, 1, "999"))
head(dt2, 3)
# # A tibble: 3 × 9
#   Name  Birth      CNS   Document SystolicBP1 DiastolicBP1 SystolicBP2 DiastolicBP2 HBA1c
#   <chr> <date>     <chr> <chr>          <dbl>        <dbl>       <dbl>        <dbl> <dbl>
# 1 Pedro 1994-07-29 700   999               NA           NA          NA           NA     7
# 2 Pedro 1994-07-29 NA    104               NA           NA          NA           NA     7
# 3 Pedro 1994-07-29 700   NA               120           80          NA           NA     7

dt2 |>
  summarize(
    .by = c(Name, Birth),
    across(everything(), ~ na.omit(.x)[1])
  )
# # A tibble: 5 × 9
#   Name     Birth      CNS   Document SystolicBP1 DiastolicBP1 SystolicBP2 DiastolicBP2 HBA1c
#   <chr>    <date>     <chr> <chr>          <dbl>        <dbl>       <dbl>        <dbl> <dbl>
# 1 Pedro    1994-07-29 700   999              120           80          NA           NA   7  
# 2 Cristina 1960-04-01 701   105              160          100         148           90   8.2
# 3 Walter   1956-12-22 702   106              152          100          NA           NA   8  
# 4 Sara     1997-02-20 703   107              114           92          NA           NA   9  
# 5 Mateus   1994-07-25 704   108               NA           NA         100           82   6.5

If this is the case, we can make the process a little smarter by imputing values (I'll use random sampling of non-NA values) and then deduplicating.

dt2 |>
  mutate(
    .by = c(Name, Birth),
    across(everything(), ~ if_else(is.na(.x), na.omit(.x)[1], .x))
  ) |>
  distinct()
# # A tibble: 6 × 9
#   Name     Birth      CNS   Document SystolicBP1 DiastolicBP1 SystolicBP2 DiastolicBP2 HBA1c
#   <chr>    <date>     <chr> <chr>          <dbl>        <dbl>       <dbl>        <dbl> <dbl>
# 1 Pedro    1994-07-29 700   999              120           80          NA           NA   7  
# 2 Pedro    1994-07-29 700   104              120           80          NA           NA   7  
# 3 Cristina 1960-04-01 701   105              160          100         148           90   8.2
# 4 Walter   1956-12-22 702   106              152          100          NA           NA   8  
# 5 Sara     1997-02-20 703   107              114           92          NA           NA   9  
# 6 Mateus   1994-07-25 704   108               NA           NA         100           82   6.5

One side-effect of doing na.omit(.x)[1] is that if all values (in that group for that column) are NA, then while na.omit(.x) will return c(), the [1] will force it to return NA.

FYI, the use of .by= requires dplyr_1.1.0 or newer; if you have an older version, change from mutate(.by=c(..), stuff) to group_by(..) |> mutate(stuff) |> ungroup().