fuzzy Logic for a String in R

137 Views Asked by At

I have 2 dataframe: DF1

ID   Address
AB1  VILL +PO CHAPAR TAPUKADA  ALWAR
AB2  VILL WARD NO 02 THIKARIYA CHAND RAWAT JUNA PADA POST BADANA  0 SIROHI
AB3  RAMKUMAR YADAV VILL  KANSL   0 JAIPUR
AB4  VILL KHERKI MUKKER  POSTPANIYA PUTLI   JAIPUR

and, df2

    Name
    CHHAPPAR
    CHHAPAR
    KANSAL
    KANSIL
    KANSOL
    KHERK
    KHERKIA
    PAR
    UR
   WAR
   RIYA
   DAV
   LI

I want to apply fuzzy logic in DF1 string. If the names given in DF1 matches with DF2, give me the DF2 name

Output should be like

ID   Address                                                                 Name
AB1  VILL +PO CHAPAR TAPUKADA  ALWAR                                         CHHAPPAR, CHHAPAR
AB2  VILL WARD NO 02 THIKARIYA CHAND RAWAT JUNA PADA POST BADANA  0 SIROHI
AB3  RAMKUMAR YADAV VILL  KANSL   0 JAIPUR                                   KANSAL, KANSIL, KANSOL
AB4  VILL KHERKI MUKKER  POSTPANIYA PUTLI   JAIPUR                           KHERK, KHERKIA

I tried applying FuzzywuzzyR but it's given an error

I tried agrep too, but it's giving me result as True/False.

Please help me out in this. Also, if I should try other packages for fuzzy

1

There are 1 best solutions below

4
JBGruber On

I would use the package fuzzyjoin for this, which works with the logic from tidytext:

library(tidytext)
library(fuzzyjoin)
library(tidyverse)

df1 %>% 
  unnest_tokens(word, Address, to_lower = FALSE) %>% 
  fuzzyjoin::stringdist_left_join(df2, by = c("word" = "Name"), max_dist = 1) %>% 
  group_by(ID) %>% # collapse unnested tokens back to text if you want
  summarise(text = paste(word, collapse = " "),
            Name = toString(na.omit(Name)))
#> # A tibble: 4 x 3
#>   ID    text                                                 Name               
#>   <chr> <chr>                                                <chr>              
#> 1 AB1   VILL PO CHAPAR TAPUKADA ALWAR                        "CHHAPAR"          
#> 2 AB2   VILL WARD NO 02 THIKARIYA CHAND RAWAT JUNA PADA POS~ ""                 
#> 3 AB3   RAMKUMAR YADAV VILL KANSL KANSL KANSL 0 JAIPUR       "KANSAL, KANSIL, K~
#> 4 AB4   VILL KHERKI KHERKI MUKKER POSTPANIYA PUTLI JAIPUR    "KHERK, KHERKIA"

data

df1 <- read.csv(text = "ID,Address
AB1,VILL +PO CHAPAR TAPUKADA  ALWAR
AB2,VILL WARD NO 02 THIKARIYA CHAND RAWAT JUNA PADA POST BADANA  0 SIROHI
AB3,RAMKUMAR YADAV VILL  KANSL   0 JAIPUR
AB4,VILL KHERKI MUKKER  POSTPANIYA PUTLI   JAIPUR", stringsAsFactors = FALSE)

df2 <- read.csv(text = "Name
CHHAPPAR
CHHAPAR
KANSAL
KANSIL
KANSOL
KHERK
KHERKIA", stringsAsFactors = FALSE)