I have two data frames:
1.df1 has information about exams done in a cohort, and has the following structure:
PersonID | ExamDate | ExamResult |
---|---|---|
001 | 02/03/2023 | Positive |
001 | 08/09/2019 | Negative |
002 | 01/30/2004 | Positive |
003 | 12/11/2022 | Positive |
003 | 01/02/2015 | Negative |
003 | 11/13/2014 | Negative |
... | ... | ... |
where one unique PersonID can have multiple ExamDates and ExamResults (e.g. ID 001 has 2 exams, ID 002 1 exam, etc.).
- df2 has the following structure:
PersonID | EventDate |
---|---|
001 | 01/01/2020 |
003 | 11/16/2016 |
... | ... |
where there can only be one EventDate per PersonID, and some PersonIDs can not have an *EventDate *(e.g. ID 002 does not appear in the table).
I want to cross-reference the EventDates on df2 so that I have a new dataframe with the same structure as df1, but that only includes rows with ExamDates that precede the EventDate.
For example, for the example above, the new dataframe would be:
PersonID | ExamDate | ExamResult |
---|---|---|
001 | 08/09/2019 | Negative |
002 | 01/30/2004 | Positive |
003 | 01/02/2015 | Negative |
003 | 11/13/2014 | Negative |
where all rows for PersonID 001 with ExamDate after 01/01/2020 are removed, all rows for PersonID 002 are included (since they don't appear in df2), and all rows for PersonID 003 with ExamDate after 11/16/2016 are excluded.
df2 has around 1,000,000 rows and df2 has around 20,000 rows, so I'm trying to figure out what would be the most efficient way to achieve this on R? I tried using a for loop but it took a long time and didn't work well. Thanks in advance for your help.
To efficiently join those dataframes I suggest making them tables instead, so you need to load library(data.table), and, optionally, library(dplyr) which is commonly associated to data.table. I expect this will be faster than a loop however I have not tested, I suggest you try it small scale first.
nb the "rolling join" (roll = TRUE) finds the nearest EventDate for each ExamDate in df1 that is less than or equal to the EventDate in df2. If there is no EventDate for a PersonID that PersonID is ratained in the result (the EventDate is treated as NA).