Cross-referencing data frames to select rows within specific date ranges

51 Views Asked by At

I have two data frames:

1.df1 has information about exams done in a cohort, and has the following structure:

PersonID ExamDate ExamResult
001 02/03/2023 Positive
001 08/09/2019 Negative
002 01/30/2004 Positive
003 12/11/2022 Positive
003 01/02/2015 Negative
003 11/13/2014 Negative
... ... ...

where one unique PersonID can have multiple ExamDates and ExamResults (e.g. ID 001 has 2 exams, ID 002 1 exam, etc.).

  1. df2 has the following structure:
PersonID EventDate
001 01/01/2020
003 11/16/2016
... ...

where there can only be one EventDate per PersonID, and some PersonIDs can not have an *EventDate *(e.g. ID 002 does not appear in the table).

I want to cross-reference the EventDates on df2 so that I have a new dataframe with the same structure as df1, but that only includes rows with ExamDates that precede the EventDate.

For example, for the example above, the new dataframe would be:

PersonID ExamDate ExamResult
001 08/09/2019 Negative
002 01/30/2004 Positive
003 01/02/2015 Negative
003 11/13/2014 Negative

where all rows for PersonID 001 with ExamDate after 01/01/2020 are removed, all rows for PersonID 002 are included (since they don't appear in df2), and all rows for PersonID 003 with ExamDate after 11/16/2016 are excluded.

df2 has around 1,000,000 rows and df2 has around 20,000 rows, so I'm trying to figure out what would be the most efficient way to achieve this on R? I tried using a for loop but it took a long time and didn't work well. Thanks in advance for your help.

1

There are 1 best solutions below

0
On

To efficiently join those dataframes I suggest making them tables instead, so you need to load library(data.table), and, optionally, library(dplyr) which is commonly associated to data.table. I expect this will be faster than a loop however I have not tested, I suggest you try it small scale first.

# Load packages
library(data.table)
library(dplyr)

# Convert data frames to data tables
df1 <- as.data.table(df1)
df2 <- as.data.table(df2)

# Convert date strings columns to Date (for date comparisons)
df1$ExamDate <- as.Date(df1$ExamDate, format="%m/%d/%Y")
df2$EventDate <- as.Date(df2$EventDate, format="%m/%d/%Y")

# Set keys for joining
setkey(df1, PersonID, ExamDate)
setkey(df2, PersonID, EventDate)

# Perform a "rolling join" (df1 rows will be returned even if there is no match to df2)
# and include df2 columns into df1 for filtering
df1 <- df1[df2, .(PersonID, ExamDate, ExamResult, EventDate = i.EventDate), roll = TRUE]

# Filter rows where ExamDate is greater than EventDate
df1 <- df1[ExamDate <= EventDate]

nb the "rolling join" (roll = TRUE) finds the nearest EventDate for each ExamDate in df1 that is less than or equal to the EventDate in df2. If there is no EventDate for a PersonID that PersonID is ratained in the result (the EventDate is treated as NA).