I have a dataset which contains a field with individual's name. Some of the names are similar with minute differences like 'CANON INDIA PVT. LTD' and 'CANON INDIA PVT. LTD.', 'Antila,Thomas' and 'ANTILA THOMAS', 'Z_SANDSTONE COOLING LTD' and 'SANDSTONE COOLING LTD' etc. I need to identify such fuzzy duplicates and create a new subset containing these records.I have a huge table containing such records,so, I'm just producing a sample.
| Name | City |
|-------------------------|:-------:|
| CANON PVT. LTD | Georgia |
| Antila,Thomas | Georgia |
| Greg | Georgia |
| St.Luke's Hospital | Georgia |
| Z_SANDSTONE COOLING LTD | Georgia |
| St.Luke's Hospital | Georgia |
| CANON PVT. LTD. | Georgia |
| SANDSTONE COOLING LTD | Georgia |
| Greg | Georgia |
| ANTILA,THOMAS | Georgia |
I want the output to be:
| Name | City |
|-------------------------|:-------:|
| CANON PVT. LTD | Georgia |
| CANON PVT. LTD. | Georgia |
| Antila,Thomas | Georgia |
| ANTILA,THOMAS | Georgia |
| Z_SANDSTONE COOLING LTD | Georgia |
| SANDSTONE COOLING LTD | Georgia |
I tried using RecordLinkage and agrep, but they give out the original data as output.
library(RecordLinkage)
ClosestMatch2 = function(string, stringVector){
distance = levenshteinSim(string, stringVector);
stringVector[distance == max(distance)]
}
Fuzzy_duplicate=ClosestMatch2(df$Name, df$Name)
The other method was:
lapply(df$Name, agrep, df$Name, value = TRUE)
Using agrep gives the output as vector indices. However, I want to extract all the records belonging to only those whose names are similar?
If you know for certain that all records are duplicated, either perfectly or approximately (as in example df).
In this case getting just the approximate duplicates is easy -- you just get all the ones that aren't perfect duplicates. Below is an example using the
dplyr::filter()andduplicated()Assuming the harder case where you don't know for certain that all the records are duplicates (either exact matches or close matches), and you want to extract only the approximate duplicates.
In this case you've got to find some approximate string matching algorithm and apply each item iteratively to all the other items. I've used the
stringdistpackage, cause I'm familiar with it.Another thing to be careful of though, is that approximate string matching will also match with the perfect duplicates, which we don't want to include in our output, so we need to remove perfect duplicates while returning.
In the example below,
ddincludes some extra Names that are not duplicates at all (either approximate or perfect), to check that this works.Which gives the following output: