Calculate person-time per calendar month using two date columns as references

173 Views Asked by At

I have a dataframe like the one below in R:

### Packages
library(tidyverse)
library(Epi)
library(survival)
library(lubridate)

### Create data:
End_Date <- as.Date("1968-01-01") + days(sample (c(250:365), size=500, replace =T))
Example_DF <- as.data.frame(End_Date)
Example_DF$Start_Date <- as.Date("1968-01-01")
Example_DF$Exposure <- Example_DF$End_Date - days(sample (c(1:249), size=500, replace =T))
Example_DF$ID <- seq(1,500,1)

What I want do is to for each calendar month from 1968-01 until and including 1969-05 create two new columns per calendar month that is summing up the number of days of person-time each person (ID) is providing as unexposed and exposed, respectively. These columns can for example be called 1968_01_Unexposed, 1968_01_Exposed etc.

The date of exposure is found in the column Exposure. What I want in the end is thus a dataframe with 41 columns (4 in the original dataframe plus 34 columns (2 per 17 calendar month between 1968-01 and 1969-05)). For example ID 1 is having 31 person days as unexposed for 1968-01, 0 days as exposed for 1968-01 etc until 1968-07, where ID 1 has 10 days of unexposed and 21 days as exposed.

Anyone knows how this can be done in a convenient way?

1

There are 1 best solutions below

0
Ray On BEST ANSWER

The following should get you going. In fact, you have developed part of the "algorithm" already yourself with the last para of your problem description.

Working with {tidyverse} and tibbles/data frames try to think in vectors/columns before presenting the result in a more human-readable wide-way.

I demonstrate the initial part of how you can go about it with your first 2 entries and solving the logical conditions on the number of days.

I leave it to you to apply this approach then to the exposed days and read up on {tidyr}'s pivot_wider() to spread your results across the desired columns.

While you provide some sample data and thus a reproducible example, the sample seems not to operate on 17 months. I did not check the example for further consistency.

library(tidyverse)
library(lubridate)

# first problem - each ID needs a month entry for our time horizon ---------------
## define the  time horizon
Month_Bin <- seq(from = min(Example_DF$Start_Date)
                 , to = max(Example_DF$End_Date)
                 , by = "month")

## expand your (here first 2 entries) over the time horizon
Example_DF[1:2,] %>%        # with [1:2,] the df is truncated to the first 2 rows - remove for full example
  expand(ID, Month_Bin)  

# combine with original data set to calculate conditions -----------------------

Example_DF[1:2,] %>% 
    expand(ID, Month_Bin) %>% 
    left_join(Example_DF, by = "ID") 

# with this data we can now work on the conditions and --------------------------
# determine the days
Example_DF[1:2,] %>% 
    expand(ID, Month_Bin) %>% 
    left_join(Example_DF, by = "ID") %>% 

## --------------- let's define whether the Month_Bin is before Exposure
## --------------- lubridate let's you work with "floored" dates ~ first of month 
mutate(
  Unexposed = floor_date( Exposure, "month") > floor_date(Month_Bin, "month")
, Exposed = floor_date(Exposure, "month")    < floor_date(Month_Bin, "month")) %>%

## -------------- now you can detemine the days per month based on the condition
## -------------- multiple if-else() conditions are nicely packed into case_when
 mutate(
    Unexposed_Days = case_when(
         Unexposed  & !Exposed ~ days_in_month(Month_Bin)
        ,!Unexposed & !Exposed ~ as.integer(difftime(Exposure, Month_Bin, "days"))
        ,TRUE ~ as.integer(NA)    # case_when() requires type consistency for default
        )
    ) %>% 
#--------------- for presentation I force the first 20 rows (ignore this)
head(20)

This yields:

# A tibble: 20 x 8
      ID Month_Bin  End_Date   Start_Date Exposure   Unexposed Exposed Unexposed_Days
   <dbl> <date>     <date>     <date>     <date>     <lgl>     <lgl>            <int>
 1     1 1968-01-01 1968-09-21 1968-01-01 1968-02-25 TRUE      FALSE               31
 2     1 1968-02-01 1968-09-21 1968-01-01 1968-02-25 FALSE     FALSE               24
 3     1 1968-03-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
 4     1 1968-04-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
 5     1 1968-05-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
 6     1 1968-06-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
 7     1 1968-07-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
 8     1 1968-08-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
 9     1 1968-09-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
10     1 1968-10-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
11     1 1968-11-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
12     1 1968-12-01 1968-09-21 1968-01-01 1968-02-25 FALSE     TRUE                NA
13     2 1968-01-01 1968-12-11 1968-01-01 1968-06-21 TRUE      FALSE               31
14     2 1968-02-01 1968-12-11 1968-01-01 1968-06-21 TRUE      FALSE               29
15     2 1968-03-01 1968-12-11 1968-01-01 1968-06-21 TRUE      FALSE               31
16     2 1968-04-01 1968-12-11 1968-01-01 1968-06-21 TRUE      FALSE               30
17     2 1968-05-01 1968-12-11 1968-01-01 1968-06-21 TRUE      FALSE               31
18     2 1968-06-01 1968-12-11 1968-01-01 1968-06-21 FALSE     FALSE               20
19     2 1968-07-01 1968-12-11 1968-01-01 1968-06-21 FALSE     TRUE                NA
20     2 1968-08-01 1968-12-11 1968-01-01 1968-06-21 FALSE     TRUE                NA

You should be able to construct the required number of days for the exposed case.

Then read up on {tidyr} and pivot_longer to spread your long table to a wide format that you want to have.