I have a dataframe like the example below:
### Packages needed for reproducible example
library(lubridate)
library(dplyr)
### Create data frame:
Person_IDs <- seq(1,1000000,1)
Example_DF <- as.data.frame(Person_IDs)
### Sex and age for matching:
set.seed(2021)
Example_DF$Sex <- sample(c("Male", "Female"), size = 1000000, replace = T)
set.seed(2021)
Example_DF$Age <- sample(c(1:100), size = 1000000, replace = T)
### Study start and end date (just for clarity):
Example_DF$Start_Date <- as.Date("2020-01-01")
Example_DF$End_Date <- as.Date("2021-05-01")
### Study outcome (85% not experiencing the outcome, 15% experiencing the outcome):
set.seed(2021)
Example_DF$Outcome <- sample(c(0, 1), size = 1000000, replace = TRUE, prob = c(0.85, 0.15))
### Timestamp for outcome (either as exposed (Outcome = 1) or censored (Outcome = 0):
Example_DF$Timestamp_Outcome <- as.Date("1900-01-01")
set.seed(2021)
Example_DF$Timestamp_Outcome[Example_DF$Outcome == 1] <- Example_DF$Start_Date[Example_DF$Outcome == 1] + days(sample (c(45:295), size=length(unique(Example_DF$Person_IDs[Example_DF$Outcome == 1])), replace =T))
set.seed(2021)
Example_DF$Timestamp_Outcome[Example_DF$Outcome == 0] <- Example_DF$Start_Date[Example_DF$Outcome == 0] + days(sample (c(275:340), size=length(unique(Example_DF$Person_IDs[Example_DF$Outcome == 0])), replace =T))
### Arrange data by timestamp outcome:
Example_DF <- Example_DF %>% arrange(Timestamp_Outcome)
### Show first rows of data frame:
head(Example_DF)
As you can see, there are:
1000000 unique individuals (Person_IDs) with a common start date of 2020-01-01 (i.e. the column Start_Date is set to 2020-01-01" for all individuals) and a common end date (End_Date) of "2021-05-01".
Information on sex and age is available, which will be used to "match" IDs where Outcome == 1 with controls.
All individuals have a date of an outcome (either with Outcome == 0 or Outcome == 1).
**What I want to perform now is something referred to as risk set sampling (or incidence density sampling). The dataframe is arranged by timestamp of outcome and now:
Each time the "algorithm" encounters a row where the Outcome == 1, a random selection of three (3) Person_IDs who have the same sex, the same age AND a later timestamp (i.e. Timestamp_Outcome is at least one day later, irrespective of if Outcome == 0 or Outcome == 1) should be performed.
These 4 individuals (the 1 exposed individual and the 3 unexposed individuals) should then be removed from the dataframe (i.e. replace = FALSE) and can thus NOT be selected again (referred to as sampling without replacement).**
To make it more clear if needed, consider the following example:
head(Example_DF)
As you can see, Person_ID 1030, 1269, 3180, 4245 etc all experience the outcome at 2020-02-15. Taking Person_ID 1030 as an example, this is a 86 year old female. She should thus be matched against three 86 year old females NOT exposed at 2020-02-15 (they can become exposed 2020-02-16, 2020-02-20 or anytime onwards). If this is not possible, as many matched individuals as possible should be selected (ranging from 0 to 3 matched individuals).
Any idea of how this can be performed?
Here's a potential solution using
data.table
and recursion:dtSamples
now has 166588 sample sets of 6 persons each, with the first in each set being the exposed person.