Dropping observations after Fuzzy Match

Question

Dropping observations after Fuzzy Match

57 Views Asked by Samira At 28 June 2025 at 07:47

I am doing some fuzzy matching using the 'matchit' command in Stata. After the fuzzy match, my data looks something like this

Identifier	Variable B	Variable C	Similarity Score
1	A	X	0.4
1	A	Y	0.6
1	A	Z	1
1	B	Y	0.2
1	B	X	0.7
1	B	Z	0.8

For each unique Variable B, I want to keep the row with highest similarity score. However, I have an exception to make. If two unique observations in Variable B matches the best to the same entry in Variable C and one has similarity score of 1, then I want to keep the row with second highest similarity score. So, the final table should look like this:

Identifier	Variable B	Variable C	Similarity Score
1	A	Z	1
1	B	X	.7

Original Q&A

There are 1 best solutions below

**harning** · Answer 1

First, we find the rows with similarity score of 1. Call them perfect matches.

// Input example data
input Identifier str1 Variable_B str1 Variable_C Similarity_Score
               1             "A"             "X"              0.4
               1             "A"             "Y"              0.6
               1             "A"             "Z"                1
               1             "B"             "Y"              0.2
               1             "B"             "X"              0.7
               1             "B"             "Z"              0.8
end

// Find rows with a similarity score of 1
preserve
keep if Similarity_Score == 1
tempfile perfect_match
save "`perfect_match'", replace

"`perfect_match'" looks like:

Identifier	Variable_B	Variable_C	Similarity_Score
1	A	Z	1

Then we get all values of Variable C in those perfect matches.

// Find which `Variable C` has a similarity score of 1
keep Variable_C
duplicates drop
tempfile temp
save "`temp'", replace

"`temp'" looks like:

Variable_C
Z

We don't want those values in "`temp'" to be matched with anything else, so drop them in the rawdata.

// Drop those `Variable C` above from the rawdata, so a perfectly matched
// `Variable C` will no longer be there
restore
merge m:1 Variable_C using "`temp'"
drop if _merge == 3
drop _merge

The remaining data look like:

Identifier	Variable_B	Variable_C	Similarity_Score
1	B	X	.7
1	A	X	.4
1	A	Y	.6
1	B	Y	.2

In the rest of data, find the best match.

// Find the best match in the remaining data
gsort Variable_B -Similarity_Score
collapse (firstnm) Identifier Variable_C Similarity_Score, by(Variable_B)

Now the data looks like:

Variable_B	Identifier	Variable_C	Similarity_Score
A	1	Y	.6
B	1	X	.7

Please note that the current match for Variable_B == "A" is wrong! This is expected, as we've removed perfect matches in the first step. Now merge them back, and use them replace the wrong matches.

// Merge the perfect matches back into the data
merge 1:1 Variable_B using "`perfect_match'", replace update nogen
order Identifier Variable_B Variable_C Similarity_Score
sort Variable_B

Here is the final output:

Identifier	Variable_B	Variable_C	Similarity_Score
1	A	Z	1
1	B	X	.7

Dropping observations after Fuzzy Match

There are 1 best solutions below

Related Questions in STATA

Related Questions in MATCHIT

Trending Questions

Popular # Hahtags

Popular Questions