Identify fuzzy duplicates from a single column and create a subset containing records of fuzzy duplicates using R

Question

Identify fuzzy duplicates from a single column and create a subset containing records of fuzzy duplicates using R

561 Views Asked by Jazz At 12 July 2019 at 12:52

I have a dataset which contains a field with individual's name. Some of the names are similar with minute differences like 'CANON INDIA PVT. LTD' and 'CANON INDIA PVT. LTD.', 'Antila,Thomas' and 'ANTILA THOMAS', 'Z_SANDSTONE COOLING LTD' and 'SANDSTONE COOLING LTD' etc. I need to identify such fuzzy duplicates and create a new subset containing these records.I have a huge table containing such records,so, I'm just producing a sample.

| Name                    |   City  |
|-------------------------|:-------:|
| CANON PVT. LTD          | Georgia |
| Antila,Thomas           | Georgia |
| Greg                    | Georgia |
| St.Luke's Hospital      | Georgia |
| Z_SANDSTONE COOLING LTD | Georgia |
| St.Luke's Hospital      | Georgia |
| CANON PVT. LTD.         | Georgia |
| SANDSTONE COOLING LTD   | Georgia |
| Greg                    | Georgia |
| ANTILA,THOMAS           | Georgia |

I want the output to be:

| Name                    |   City  |
|-------------------------|:-------:|
| CANON PVT. LTD          | Georgia |
| CANON PVT. LTD.         | Georgia |
| Antila,Thomas           | Georgia |
| ANTILA,THOMAS           | Georgia |
| Z_SANDSTONE COOLING LTD | Georgia |
| SANDSTONE COOLING LTD   | Georgia |

I tried using RecordLinkage and agrep, but they give out the original data as output.

library(RecordLinkage)
ClosestMatch2 = function(string, stringVector){
  distance = levenshteinSim(string, stringVector);
  stringVector[distance == max(distance)]
  }
Fuzzy_duplicate=ClosestMatch2(df$Name, df$Name)

The other method was:

lapply(df$Name, agrep, df$Name, value = TRUE)

Using agrep gives the output as vector indices. However, I want to extract all the records belonging to only those whose names are similar?

Original Q&A

There are 3 best solutions below

**Josh White** · Answer 1 · 2022-11-14T06:59:19.573000

If you know for certain that all records are duplicated, either perfectly or approximately (as in example df).

In this case getting just the approximate duplicates is easy -- you just get all the ones that aren't perfect duplicates. Below is an example using the dplyr::filter() and duplicated()

library(dplyr)
library(tibble)

d <- tibble::tribble(
  ~Name, ~ City,
  "CANON PVT. LTD"           , "Georgia",
  'Antila,Thomas'            , "Georgia",
  "Greg"                     , "Georgia",
  "St.Luke's Hospital"       , "Georgia",
  "Z_SANDSTONE COOLING LTD"  , "Georgia",
  "St.Luke's Hospital"       , "Georgia",
  "CANON PVT. LTD."          , "Georgia",
  "SANDSTONE COOLING LTD"    , "Georgia",
  "Greg"                     , "Georgia",
  "ANTILA,THOMAS"            , "Georgia"  
)

d %>% 
  filter(!Name %in% Name[duplicated(Name)]) #%>%
  # arrange(Name) #arrange together if you want

# A tibble: 6 × 2
  Name                    City   
  <chr>                   <chr>  
1 CANON PVT. LTD          Georgia
2 Antila,Thomas           Georgia
3 Z_SANDSTONE COOLING LTD Georgia
4 CANON PVT. LTD.         Georgia
5 SANDSTONE COOLING LTD   Georgia
6 ANTILA,THOMAS           Georgia

Assuming the harder case where you don't know for certain that all the records are duplicates (either exact matches or close matches), and you want to extract only the approximate duplicates.

In this case you've got to find some approximate string matching algorithm and apply each item iteratively to all the other items. I've used the stringdist package, cause I'm familiar with it.

Another thing to be careful of though, is that approximate string matching will also match with the perfect duplicates, which we don't want to include in our output, so we need to remove perfect duplicates while returning.

In the example below, dd includes some extra Names that are not duplicates at all (either approximate or perfect), to check that this works.

library(stringdist)

dd <- tibble::tribble(
  ~Name, ~ City,
  "CANON PVT. LTD"           , "Georgia",
  'Antila,Thomas'            , "Georgia",
  "Greg"                     , "Georgia",
  "St.Luke's Hospital"       , "Georgia",
  "Z_SANDSTONE COOLING LTD"  , "Georgia",
  "St.Luke's Hospital"       , "Georgia",
  "CANON PVT. LTD."          , "Georgia",
  "SANDSTONE COOLING LTD"    , "Georgia",
  "Greg"                     , "Georgia",
  "ANTILA,THOMAS"            , "Georgia",
  "MRS BOTHAM"               , "Georgia",
  "BIG COMPANY LTD"          , "Georgia"
)

#function to return indices of approximate matches
get_amatches <- function(x, method = "osa", maxDist = 3) {
  res <- lapply(x, \(y) {
    am <- stringdist::ain(tolower(x), tolower(y), method = method, maxDist = maxDist) #approx match (incl. perf match)
    pm <- x %in% y # is perfect match
    which(am & !pm) # index of am but not pm
  })
  unique(unlist(res))
}

#return approximate matches in df
dd %>% 
  slice(get_amatches(Name)) %>% 
  arrange(Name)

Which gives the following output:


# A tibble: 6 × 2
  Name                    City   
  <chr>                   <chr>  
1 Antila,Thomas           Georgia
2 ANTILA,THOMAS           Georgia
3 CANON PVT. LTD          Georgia
4 CANON PVT. LTD.         Georgia
5 SANDSTONE COOLING LTD   Georgia
6 Z_SANDSTONE COOLING LTD Georgia

**Benoît Fayolle** · Answer 2 · 2022-11-14T09:31:29.063000

Using RecordLinkage::compare.dedup and enabling the string comparison with the strcmp argument gives good results in your case.

library(RecordLinkage)
library(tibble)

df <- tribble(~Name,~City,
  "CANON PVT. LTD"          , "Georgia" ,
  "Antila,Thomas"           , "Georgia" ,
  "Greg"                    , "Georgia" ,
  "St.Luke's Hospital"      , "Georgia" ,
  "Z_SANDSTONE COOLING LTD" , "Georgia" ,
  "St.Luke's Hospital"      , "Georgia" ,
  "CANON PVT. LTD."         , "Georgia" ,
  "SANDSTONE COOLING LTD"   , "Georgia" ,
  "Greg"                    , "Georgia" ,
  "ANTILA,THOMAS"           , "Georgia")
df_comp <- compare.dedup(df,exclude = c(2),strcmp = T)
weight_list <- epiWeights(df_comp) 
getPairs(weight_list,single.rows=T,min.weight=0.90)

	id1	Name.1	City.1	id2	Name.2	City.2	Weight
3	3	Greg	Georgia	9	Greg	Georgia	1.0000000
4	4	St.Luke's Hospital	Georgia	6	St.Luke's Hospital	Georgia	1.0000000
1	1	CANON PVT. LTD	Georgia	7	CANON PVT. LTD.	Georgia	0.9866667
5	5	Z_SANDSTONE COOLING LTD	Georgia	8	SANDSTONE COOLING LTD	Georgia	0.9710145

I used a min.weight threshold of 0.9 but first false positive appear at 0.63 which gives a good margin.

**MrSmithGoesToWashington** · Answer 3 · 2022-11-18T14:44:50.727000

For very heavy data, I would use something like this (but don't know how this solution behave compared to others solutions provided ) :

rm(list=ls())
gc()

#  packages needed
for (package in c('data.table', 'stringr', 'stringdist')) {
  if (!require(package, character.only = TRUE, quietly = TRUE)) {
    install.packages(package, dependencies = TRUE)
    library(package, character.only = TRUE)
  }
}

# Data
tbl.data <- data.table(name = c("CANON PVT. LTD", "Antila,Thomas", "Greg", "St.Luke's Hospital", 
                              "Z_SANDSTONE COOLING LTD", "St.Luke's Hospital", 
                              "CANON PVT. LTD.",
                              "SANDSTONE COOLING LTD",
                              "Greg", "ANTILA,THOMAS"))

# A string with just letters
tbl.data[, string_ := str_replace_all(tolower(str_replace_all(name, "[^[:graph:]]", "")),"[[:punct:][:digit:]]", "")]

# Create a table for comprisons ----
################################ -
tbl.a <- tbl.data[, .(string_1 = string_, names1 = name, one = 1, id1 = .I)]
tbl.b <- tbl.data[, .(string_2 = string_,  names2 =name, one = 1, id2 = .I)]
  # avoid unneeded lines :
tbl.a <- unique(tbl.a)
tbl.b <- unique(tbl.b)
  # the table of all couples :
tbl.dedoubl <- merge(tbl.a, tbl.b, by = "one",  allow.cartesian = TRUE)
  # avoid unneeded comparisons :
tbl.dedoubl <- tbl.dedoubl[id1 < id2]

# Distance calculations : ----
################################ -
tbl.dedoubl[, distance := stringdist(string_1, string_2, method = "lv")]

# Lines suspected to be doubles : ----
################################### -
tbl.dedoubl[distance <= 2]

Identify fuzzy duplicates from a single column and create a subset containing records of fuzzy duplicates using R

There are 3 best solutions below

If you know for certain that all records are duplicated, either perfectly or approximately (as in example df).

Assuming the harder case where you don't know for certain that all the records are duplicates (either exact matches or close matches), and you want to extract only the approximate duplicates.

Related Questions in R

Related Questions in DUPLICATES

Related Questions in LEVENSHTEIN-DISTANCE

Related Questions in FUZZY-COMPARISON

Related Questions in AGREP

Trending Questions

Popular # Hahtags

Popular Questions