edited - fixed the typos that Frank and ycw pointed out
I am having trouble identifying pairs of values in a dataframe that exist in another data frame. Because my data is in long format and each column contains the same value many times, I'm getting overly broad results when I try to identify matches. I've tried a number of ways and can't get the exact results I need. Given how simple the operation is, I expect I'm missing something very obvious.
For my example, I have a set of person records in long format, df ppl
. I also have another set of person observations for specific time periods, in this example, it's absences on certain days dfabs
. I want to mutate ppl
to create a new column, that will show when a person-day row matches a person-day observation in df dfabs
. I keep getting a problem when I try to mutate however, as it's not clear to me how to subset based on two columns at once.
Example below:
name <- c("Bob", "Bob", "Ana","Ana", "Jorge","Jorge")
day <- c(1,2,1,2,1,2)
eval <- c(4,3,5,5,2,5)
ppl <- as.data.frame(cbind(name,day, eval), names=TRUE)
name day eval
1 Bob 1 4
2 Bob 2 3
3 Ana 1 5
4 Ana 2 5
5 Jorge 1 2
6 Jorge 2 5
nameabs <- c("Bob","Jorge","Cindy")
dayabs <- c(1,2,1)
dfabs <- as.data.frame(cbind(nameabs, dayabs), names=TRUE)
dfabs
nameabs dayabs
1 Bob 1
2 Jorge 2
3 Cindy 1
The result I want to see is:
name day eval absent
1 Bob 1 4 1
2 Bob 2 3 0
3 Ana 1 5 0
4 Ana 2 5 0
5 Jorge 1 2 0
6 Jorge 2 5 1
To do this, I've tried using the mutate function:
mutate(ppl, absent = ifelse(((name %in% dfabs$nameabs) & (day %in% dfabs$dayabs)),1,0))
Result:
name day eval absent
1 Bob 1 4 1
2 Bob 2 3 1
3 Ana 1 5 0
4 Ana 2 5 0
5 Jorge 1 2 1
6 Jorge 2 5 1
This is not right. It lists Bob and Jorge as absent on both days. It seems that in my ifelse() statement, it evaluated the two logical expressions (name %in% dfabs$nameabs)
and (day %in% dfabs$dayabs)
on separate rows.
Using other functions such as filter()
gives me the same problem. I can't figure out how to make it evaluate the entire expression on one row. In general, with and without dplyr, I've been having a lot of trouble with this general problem: how to subset one df based on a pair's (across two columns) presence in another df? I'm working with a much larger and more complex data set, and having a generalizeable answer would be very helpful.
Thanks in advance for the help!