Dropping observations after Fuzzy Match

48 Views Asked by At

I am doing some fuzzy matching using the 'matchit' command in Stata. After the fuzzy match, my data looks something like this

Identifier Variable B Variable C Similarity Score
1 A X 0.4
1 A Y 0.6
1 A Z 1
1 B Y 0.2
1 B X 0.7
1 B Z 0.8

For each unique Variable B, I want to keep the row with highest similarity score. However, I have an exception to make. If two unique observations in Variable B matches the best to the same entry in Variable C and one has similarity score of 1, then I want to keep the row with second highest similarity score. So, the final table should look like this:

Identifier Variable B Variable C Similarity Score
1 A Z 1
1 B X .7
1

There are 1 best solutions below

0
On

First, we find the rows with similarity score of 1. Call them perfect matches.

// Input example data
input Identifier str1 Variable_B str1 Variable_C Similarity_Score
               1             "A"             "X"              0.4
               1             "A"             "Y"              0.6
               1             "A"             "Z"                1
               1             "B"             "Y"              0.2
               1             "B"             "X"              0.7
               1             "B"             "Z"              0.8
end

// Find rows with a similarity score of 1
preserve
keep if Similarity_Score == 1
tempfile perfect_match
save "`perfect_match'", replace

"`perfect_match'" looks like:

Identifier Variable_B Variable_C Similarity_Score
1 A Z 1

Then we get all values of Variable C in those perfect matches.

// Find which `Variable C` has a similarity score of 1
keep Variable_C
duplicates drop
tempfile temp
save "`temp'", replace

"`temp'" looks like:

Variable_C
Z

We don't want those values in "`temp'" to be matched with anything else, so drop them in the rawdata.

// Drop those `Variable C` above from the rawdata, so a perfectly matched
// `Variable C` will no longer be there
restore
merge m:1 Variable_C using "`temp'"
drop if _merge == 3
drop _merge

The remaining data look like:

Identifier Variable_B Variable_C Similarity_Score
1 B X .7
1 A X .4
1 A Y .6
1 B Y .2

In the rest of data, find the best match.

// Find the best match in the remaining data
gsort Variable_B -Similarity_Score
collapse (firstnm) Identifier Variable_C Similarity_Score, by(Variable_B)

Now the data looks like:

Variable_B Identifier Variable_C Similarity_Score
A 1 Y .6
B 1 X .7

Please note that the current match for Variable_B == "A" is wrong! This is expected, as we've removed perfect matches in the first step. Now merge them back, and use them replace the wrong matches.

// Merge the perfect matches back into the data
merge 1:1 Variable_B using "`perfect_match'", replace update nogen
order Identifier Variable_B Variable_C Similarity_Score
sort Variable_B

Here is the final output:

Identifier Variable_B Variable_C Similarity_Score
1 A Z 1
1 B X .7