I am doing some fuzzy matching using the 'matchit' command in Stata. After the fuzzy match, my data looks something like this
Identifier | Variable B | Variable C | Similarity Score |
---|---|---|---|
1 | A | X | 0.4 |
1 | A | Y | 0.6 |
1 | A | Z | 1 |
1 | B | Y | 0.2 |
1 | B | X | 0.7 |
1 | B | Z | 0.8 |
For each unique Variable B, I want to keep the row with highest similarity score. However, I have an exception to make. If two unique observations in Variable B matches the best to the same entry in Variable C and one has similarity score of 1, then I want to keep the row with second highest similarity score. So, the final table should look like this:
Identifier | Variable B | Variable C | Similarity Score |
---|---|---|---|
1 | A | Z | 1 |
1 | B | X | .7 |
First, we find the rows with similarity score of 1. Call them perfect matches.
"`perfect_match'"
looks like:Then we get all values of
Variable C
in those perfect matches."`temp'"
looks like:We don't want those values in
"`temp'"
to be matched with anything else, so drop them in the rawdata.The remaining data look like:
In the rest of data, find the best match.
Now the data looks like:
Please note that the current match for
Variable_B == "A"
is wrong! This is expected, as we've removed perfect matches in the first step. Now merge them back, and use them replace the wrong matches.Here is the final output: