I'm trying to join two datasets on based on the values of two variables. Both datasets have the same variable names/number of columns but may have a different number of rows. I want to join them based on a grouping variable ("SampleID") and a variable that contains text strings ("PrimConstruct"). I want to use fuzzy matching to account for slight misspellings and differences in capitalizations in "PrimConstruct". There's also the chance that one dataset has a row with a text value that the other dataset does not, in which case I would like it to leave that as a separate row.
df1 <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201",
"bcd0201"), PrimConstruct_a = c("cohesion", "cognition",
"cohesion", "cognition"))
df2 <- data.frame(SampleID_b = c("abc0101", "abc0101", "bcd0201", "bcd0201",
"bcd0201"), PrimConstruct_b = c("cohesion", "cognition",
"commitment", "Cohesion", "cognitiion"))
# df2 has misspelling, different capitalization,
# and entry with no close match
I would like the fuzzy matching to only occur within the same SampleID, so "cohesion" for abc0101 will not be matched with "cohesion" for bcd0201. I am eventually hoping to have a dataset that keeps both PrimConstruct columns and all the values, but the matched values are in the same row:
desireddf <- data.frame(SampleID_a = c("abc0101", "abc0101", "bcd0201", "bcd0201", "bcd0201"), PrimConstruct_a = c("cohesion", "cognition", "cohesion", "cognition", "NA"), PrimConstruct_a = c("cohesion", "cognition", "Cohesion", "cognitiion", "commitment")
I have tried stringdist_join but it ends up duplicating and adding a bunch of rows somehow.
joined <- stringdist_join(df1,
df2,
by = c("PrimConstruct_a" = "PrimConstruct_b",
"SampleID_a" = "SampleID_b"),
mode = "full",
method = "jw",
max_dist = 2,
ignore_case = T)
I'm a little confused by the different matching methods, but I don't expect differences in text strings to be much more than minor spelling mistakes and differences in capitalization, which I believe should be covered by ignore_case.
string distance with multiple colum matches and distinct join function (in your case join equi and one fuzzy join) can be done like this with the
fuzzyjoinpackage:As you might have noticed the match can return more than one column as possibly there are more matches satisfying the condition. So you could work the data now group wise, which is why I introduced "rn", as you might want onyl the closest match. To show you how you can get down to this, I will leave an optional approach that does not use the
fuzzyjoinbut thestringdistpackge (which fuzzyjoin is based one), as it might help you to understand the workings better:Depending on your use case there are some possible tweaks:
You could use tolower() or toupper() (result is the same) in the matching function of the fuzzyjoin or prior to the join or distance calculation on PrimConstruct_a and PrimConstruct_b, which will solve all stringdistances caused by upper vs lower case and leave only issues of letter diferences to be counted.
Read into the distinct stringdistance functions the
stringdistand thereforefuzzyjoinpackage have implemented. There are some diference which might be helpfull depending on the usecaseEDIT
you are looking for something like a full join, though it is not that easy. One way to solve it is assigning the data from the prio step to a new variable (called it "res") and identify non used cases from df2 to union them back to the result like this: