Computing number of participants by timepoints

89 Views Asked by At

I have a short question about computing the number of participants by timepoints. Consider the sample long format data:

data<-data.frame(id=c(1,1,1,1,2,2,2,3,3,3,3),survey_date=c("01/12/2020","02/12/2020","03/12/2020","04/12/2020","01/31/2020","03/12/2020","04/05/2020","02/12/2020","04/12/2020","05/12/2020","06/12/2020"),last_seen=c('05/25/2020','05/25/2020','05/25/2020','05/25/2020','04/06/2020','04/06/2020','04/06/2020','','','',''))

Survey date indicates when the survey took place. \Some subjects in the data became lost to follow-up, we do have that last-seen date recorded in the data for those subjects (it appears in all rows for that subject).

I would like to add a column 'num_N' to the existing data indicating the number of participants who were in the study at the timepoint survey_date.

For example, on 01/12/2020, only subject id=1 was in the study, so num_N for that row would be 1.
On 06/12/2020, only subject id=3 was in the study, so num_N for that row would be 1.

Any help would be appreciated. Thanks!

3

There are 3 best solutions below

0
On BEST ANSWER

This is a good case for iv_count_between from ivs:

library(dplyr)
library(ivs)

#Convert your dates to date format
data <- data |> 
  mutate(across(-id, lubridate::mdy)) 

#Create intervals
data_ivs <- 
  data |> 
  summarise(min = min(survey_date), 
            max = max(survey_date, last_seen, na.rm = TRUE), 
            .by = id) |> 
  mutate(ivs = iv(min, max + 1))

#Count intervals with `iv_count_between`:
data |> 
  mutate(num_N = iv_count_between(survey_date, data_ivs$ivs))

#    id survey_date  last_seen num_N
# 1   1  2020-01-12 2020-05-25     1
# 2   1  2020-02-12 2020-05-25     3
# 3   1  2020-03-12 2020-05-25     3
# 4   1  2020-04-12 2020-05-25     2
# 5   2  2020-01-31 2020-04-06     2
# 6   2  2020-03-12 2020-04-06     3
# 7   2  2020-04-05 2020-04-06     3
# 8   3  2020-02-12       <NA>     3
# 9   3  2020-04-12       <NA>     2
# 10  3  2020-05-12       <NA>     2
# 11  3  2020-06-12       <NA>     1
0
On

First things first. When dealing with strings representing dates in R, you need to convert these to actual date objects in order to perform arithmetic or logical tests on them. For your case, I would also convert the NA values in the last column into some date in the future of the last survey date (for example today's date):

library(tidyverse)

data2 <- data %>%
  mutate(across(-id, ~as.Date(.x, format = '%m/%d/%Y'))) %>%
  mutate(last_seen = if_else(is.na(last_seen), Sys.Date(), last_seen))

Now your data looks like this:

data2
#>    id survey_date  last_seen
#> 1   1  2020-01-12 2020-05-25
#> 2   1  2020-02-12 2020-05-25
#> 3   1  2020-03-12 2020-05-25
#> 4   1  2020-04-12 2020-05-25
#> 5   2  2020-01-31 2020-04-06
#> 6   2  2020-03-12 2020-04-06
#> 7   2  2020-04-05 2020-04-06
#> 8   3  2020-02-12 2023-10-24
#> 9   3  2020-04-12 2023-10-24
#> 10  3  2020-05-12 2023-10-24
#> 11  3  2020-06-12 2023-10-24

From this we can construct a little lookup table

lookup <- data2 %>%
  group_by(id) %>%
  summarise(first_date = min(survey_date), 
            last_date  = first(last_seen))

And we can get the number of participants at each survey date using simple logical tests:

data2 <- data2 %>%
  rowwise() %>%
  mutate(num_N = sum(survey_date >= lookup$first_date &
                     survey_date <= lookup$last_date))

So your data looks like this:

data2
#> # A tibble: 11 x 4
#> # Rowwise: 
#>       id survey_date last_seen  num_N
#>    <dbl> <date>      <date>     <int>
#>  1     1 2020-01-12  2020-05-25     1
#>  2     1 2020-02-12  2020-05-25     3
#>  3     1 2020-03-12  2020-05-25     3
#>  4     1 2020-04-12  2020-05-25     2
#>  5     2 2020-01-31  2020-04-06     2
#>  6     2 2020-03-12  2020-04-06     3
#>  7     2 2020-04-05  2020-04-06     3
#>  8     3 2020-02-12  2023-10-24     3
#>  9     3 2020-04-12  2023-10-24     2
#> 10     3 2020-05-12  2023-10-24     2
#> 11     3 2020-06-12  2023-10-24     1

If you really wanted to, you could add this final column back into your original data:

within(data, num_N <- data2$num_N)
#>    id survey_date  last_seen num_N
#> 1   1  01/12/2020 05/25/2020     1
#> 2   1  02/12/2020 05/25/2020     3
#> 3   1  03/12/2020 05/25/2020     3
#> 4   1  04/12/2020 05/25/2020     2
#> 5   2  01/31/2020 04/06/2020     2
#> 6   2  03/12/2020 04/06/2020     3
#> 7   2  04/05/2020 04/06/2020     3
#> 8   3  02/12/2020                3
#> 9   3  04/12/2020                2
#> 10  3  05/12/2020                2
#> 11  3  06/12/2020                1
0
On

Here's a ridiculously convoluted method.

Prepare the provided data:

library(tidyverse)
data<-data.frame(id=c(1,1,1,1,2,2,2,3,3,3,3),survey_date=c("01/12/2020","02/12/2020","03/12/2020","04/12/2020","01/31/2020","03/12/2020","04/05/2020","02/12/2020","04/12/2020","05/12/2020","06/12/2020"),last_seen=c('05/25/2020','05/25/2020','05/25/2020','05/25/2020','04/06/2020','04/06/2020','04/06/2020','','','',''))

data <- data |> mutate(across(2:3, \(x) as.Date(x,format = "%m/%d/%Y")))
# set the date formats

Next, prepare a table of participant start/end dates:

data_ranges <- data |> mutate(first_seen = min(survey_date),
               last_seen = if_else(!is.na(last_seen),
                                   last_seen,
                                   max(survey_date)),
               .by = id) |> 
  select(-survey_date) |> 
  distinct()

Next, prepare a helper table with every date between the earliest start and the latest departure from the study, and count how many of the participants bracket that date:

data2 <- tibble(
  date = seq(from = min(data$survey_date),
             to = max(c(data$survey_date, data$last_seen), na.rm = TRUE),
             by = "days"),
  n = sapply(date, 
             \(date) nrow(filter(data_ranges, first_seen <= date & last_seen >= date)))
)

and combine the datasets:

left_join(data, data2, by = c("survey_date" = "date"))  

   id survey_date  last_seen n
1   1  2020-01-12 2020-05-25 1
2   1  2020-02-12 2020-05-25 3
3   1  2020-03-12 2020-05-25 3
4   1  2020-04-12 2020-05-25 2
5   2  2020-01-31 2020-04-06 2
6   2  2020-03-12 2020-04-06 3
7   2  2020-04-05 2020-04-06 3
8   3  2020-02-12       <NA> 3
9   3  2020-04-12       <NA> 2
10  3  2020-05-12       <NA> 2
11  3  2020-06-12       <NA> 1

I am not happy with this as a 'final' version - in particular, the sapply seems very slow and unweildy, but I'm out of time I'm afraid!