Can I match 2 data frames by a col, even if the data frames are of different length?

50 Views Asked by At

I am trying to combine 2 datasets of unequal length in a similar way to the example below for a longitudinal study. Dataset 1 includes each participant only once, with the row of data from their first weekly survey. Dataset 2 includes all surveys from all participants. I am trying to create a third dataset that accounts for missing weekly surveys. For example, if participant 2 missed their survey on the 17th of Jan, it will still show week 2, participant id and date with the rest of the cols blank. Any ideas on how to accomplish this are much appreciated as I am very new to R.

#dataframe 1 (many more value cols)

ID   date       value  Weeknumber
1    March 1      8       1
2    Jan 10       9       1
3    April 12     12      1
4    Dec 9        6       1




#Dataframe 2

ID     date      value
1      March 1    8
1      March 8    3
1      March 15   9
1      March 22   11
1      March 29   5
2      Jan 10     9
2      Jan 24     5
2      Jan 31     12
2      Feb 7      7
3      April 12   12
3      April 19   3
3      April 26   10
3      May 2      6
4      Dec 9      6
4      Dec 30     7
4      Jan 6      11

#Desired output:
ID     Date       Value   Week number
1      March 1    8           1
1      March 8    3           2
1      March 15   9           3
1      March 22   11          4
1      March 29   5           5
2      Jan 10     9           1
2      Jan 17                 2
2      Jan 24     5           3
2      Jan 31     12          4
2      Feb 7      7           5
3      April 12   12          1
3      April 19   3           2
3      April 26   10          3
3      May 2      6           4
3      May 9                  5
4      Dec 9      6           1
4      Dec 16                 2
4      Dec 23                 3
4      Dec 30     7           4
4      Jan 6      11          5
      
2

There are 2 best solutions below

0
On

Here is another approach to consider using tidyverse.

First, would consider including years for your dates. If you include year, then you can account for leap years in determining dates of missing weeks more accurately. As you mention being very new to R, let me know if want me to add details on converting the dates.

Next, selecting ID and date from your first data frame df1, you can group_by ID, where subsequent procedures are done within each ID. Using mutate and map you can add rows with a sequence of 5 weeks starting with the original date.

After that, you can merge with left_join the other data frame df2. The missing weeks will have NA for value. Finally, we can add the row_number() within each ID to be the Weeknumber.

One other final concern noticed with the example date, the dates April 26 and May 2 are only 6 days apart. The join would miss this if not exactly one week. There could be alternative approaches if the dates are not exactly one week apart.

library(tidyverse)

df1[,c("ID", "date")] %>%
  group_by(ID) %>%
  mutate(date = map(date, seq.Date, length.out = 5, by = "week")) %>%
  unnest(cols = c(date)) %>%
  left_join(df2, by = c("ID", "date")) %>%
  mutate(Weeknumber = row_number())

Output

      ID date       value Weeknumber
   <dbl> <date>     <dbl>      <int>
 1     1 2020-03-01     8          1
 2     1 2020-03-08     3          2
 3     1 2020-03-15     9          3
 4     1 2020-03-22    11          4
 5     1 2020-03-29     5          5
 6     2 2020-01-10     9          1
 7     2 2020-01-17    NA          2
 8     2 2020-01-24     5          3
 9     2 2020-01-31    12          4
10     2 2020-02-07     7          5
11     3 2020-04-12    12          1
12     3 2020-04-19     3          2
13     3 2020-04-26    10          3
14     3 2020-05-03    NA          4
15     3 2020-05-10    NA          5
16     4 2020-12-09     6          1
17     4 2020-12-16    NA          2
18     4 2020-12-23    NA          3
19     4 2020-12-30     7          4
20     4 2021-01-06    11          5
0
On

A possible way to do this is by using the function "match". But you need a proper map that map one value to the other. Let's make an example. I generate a data frame of random letters on a column and number on the other:

adf=data.frame(a_lett=sample(letters, 10), a_num=1:10)

   a_lett a_num
1       o     1
2       b     2
3       t     3
4       v     4
5       a     5
6       x     6
7       u     7
8       e     8
9       h     9
10      c    10

Now I want to add another column by using the match function. So I generate my "map", which is another dataframe that says which ones are to wovels.

adf2=data.frame(voc_letter=c("a","e", "i", "o", "u"), is_vocal=paste0("vocal", 1:5))

  voc_letter is_vocal
1          a   vocal1
2          e   vocal2
3          i   vocal3
4          o   vocal4
5          u   vocal5

Take in mind that this map is NOT complete, in fact, it doesn't map the consonants.

I can then use "match". "match" return the position in the second argument for each element of the first argument. We can, therefore, use these positions to call the elements in the column adf2$is_vocal and assign this to a new column in adf.

adf$is_vocal=adf2$is_vocal[match(adf$a_lett, adf2$voc_letter)]


  a_lett a_num is_vocal
1       o     1   vocal4
2       b     2     <NA>
3       t     3     <NA>
4       v     4     <NA>
5       a     5   vocal1
6       x     6     <NA>
7       u     7   vocal5
8       e     8   vocal2
9       h     9     <NA>
10      c    10     <NA>

The many "NA" is due to the fact that there is no correspondence between consonants and the adf2 data frame.