Return vector of words matched with fuzzy matching

Question

Return vector of words matched with fuzzy matching

816 Views Asked by Jaccar At 12 July 2019 at 11:02

I am using agrepl() to filter a data.table by fuzzy matching a word. This is working fine for me, using something like this:

 library(data.table)
 data <- as.data.table(iris)
 pattern <- "setosh"
 dt <- data[, lapply(.SD, function(x) agrepl(paste0("\\b(", pattern, ")\\b"), x, fixed = FALSE, ignore.case = TRUE))] 
 data<- data[rowSums(dt) > 0]
 head(data)

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1:          5.1         3.5          1.4         0.2  setosa
2:          4.9         3.0          1.4         0.2  setosa
3:          4.7         3.2          1.3         0.2  setosa
4:          4.6         3.1          1.5         0.2  setosa
5:          5.0         3.6          1.4         0.2  setosa
6:          5.4         3.9          1.7         0.4  setosa

Obviously you can see by looking at this that "setosh" will have been fuzzy matched to "setosa" in this instance. What I want is to get a vector of words that have been matched to "setosh". So although not relevant in this example, if it had included another category like "seposh", that would have matched too, so you'd have a vector that is c("setosa", "seposh").

EDIT:

Thanks for the answer below - I can see how it's possible to isolate the values where the fuzzy matching occurs when just looking at a vector, but my issues are:

I only want the string that has matched, not the entire value.
I'm having trouble replicating this over my data.table.

For e.g., if I change a value to make this point a bit more easily...

data <- as.data.table(iris)
data[Species == "versicolor", Species := "setosh species"] # changing a value so it would match
pattern <- "setosh"

dt <- data[, lapply(.SD, function(x) agrep(paste0("\\b(", pattern, ")\\b"), x, value = TRUE, fixed = FALSE, ignore.case = TRUE))] 
Warning messages:
1: In as.data.table.list(jval) :
  Item 1 is of size 0 but maximum size is 100, therefore recycled with 'NA'
2: In as.data.table.list(jval) :
  Item 2 is of size 0 but maximum size is 100, therefore recycled with 'NA'
3: In as.data.table.list(jval) :
  Item 3 is of size 0 but maximum size is 100, therefore recycled with 'NA'
4: In as.data.table.list(jval) :
  Item 4 is of size 0 but maximum size is 100, therefore recycled with 'NA'

unique(dt)
          Species
1:         setosa
2: setosh species

You can see that I haven't got the result in a vector, and that the result includes the full value "setosh species" rather than just "setosh" (as the part that matched).

Hope that's more helpful!

Original Q&A

There are 2 best solutions below

**January** · Answer 1 · 2019-07-12T11:45:30.920000

Just use the output of agrep as an index for a character vector you are grepping.

vec <- c("setosh", "setosz", "sethosz", "etosh", "ethos", "seosh")
idx <- agrep("setosh", vec) # grepl works as well
vec[idx]

result:

[1] "setosh" "setosz" "etosh"  "seosh"

EDIT: OK, but what if we want as result only the matched string? Not the whole thing, but just the part that was matched? Then we are in for a bit of fun, because grep/grepl and agrep/agrepl don't work that way. Luckily, there is the aregexec function.

vec <- c("setosh is my name", "setosz", "sethosz who", 
         "what etosh", "ethos", "seosh", "funk setos brother")
matches <- aregexec("setosh", vec)

matches now contains a list with one element for each element of vec. Each element of this list contains a single number – start of the match – with an attribute match.length:

> matches[[1]]
[1] 1
attr(,"match.length")
[1] 6

We can use these numbers to extract the matched strings.

library(purrr)
starts <- unlist(matches)
ends <- starts - 1 + map_int(matches, ~ attr(., "match.length"))
res <- substr(vec, starts, ends)
res[ starts < 0 ] <- NA

FINAL EDIT:

I am not sure what this business with grepping all columns of iris is about, but to get a vector of matched substrings in the Species column I would do the following:

vec <- data$Species
matches <- aregexec("setosh", vec)
starts <- unlist(matches)
ends <- starts - 1 + map_int(matches, ~ attr(., "match.length"))
res <- substr(vec, starts, ends)
res[ starts < 0 ] <- NA

With res, we can do Stuff. We can remove the NA's and take a look at unique values:

res <- res[ !is.na(res) ]
unique(res)

Result:

[1] "setosa" "setosh"

FINAL FINAL EDIT: It appears that the example chosen by the OP was not exactly what they had in mind. Thus, we are going to make another example.

vec <- c("setosh is my name", "setosz", "sethosz who", 
         "what etosh", "ethos", "seosh", "funk setos brother")
data <- data.table(matrix(sample(vec, 100, replace=T), ncol=5))

data is now a data.table and in each column there are numerous things to match. If we only want to know what kind of matches are there, and we don't need to know in which columns and rows these matches were found, and we want to search through all columns, then we don't need it to be a two-dimensional object. Better make it a vector:

vec <- unlist(data)

OK, but if all that you want is to get the unique matches, we can simplify it even further:

vec <- unique(vec)

Now we have a character vector. If you now use aregexec to find your matches and extract the matches as described above you will end up with a character vector which

contains unique values
the values are the substrings that were actually matched, not the whole strings
only the matched substrings will be returned

The output will be:

[1] "setosh" "setosz" "setos " "seosh"  " etosh"

**AudioBubble** · Answer 2 · 2019-07-13T08:04:10.563000

If I understand you correctly you really just want to extract a fuzzy match from strings. It sounds like there is also some issue with doing this with a dataframe and returning a vector, but I think it becomes much simpler once you've successfully extracted the matching substrings.

I'll use the following toy data:

library(data.table)
set.seed(123)
data <-
    as.data.table(matrix(sample(c("setosa", "blah seposa", "blah setosh blah",
                                  "bleh versicolor", "bluh s", "bloh"),
                                15, T),
                         ncol = 3))

Which returns this data.table:

                 V1               V2               V3
1: blah setosh blah             bloh             bloh
2:             bloh blah setosh blah           setosa
3: blah setosh blah         bluh sep      blah seposa
4:      blah seposa  bleh versicolor blah setosh blah
5:      blah seposa             bloh         bluh sep

January has already pointed out that you can use aregexec to get the position of a fuzzy match in a character string. You can extract the match by passing aregexec's output into regmatches. We can do this for each column of our data using lapply:

data[, lapply(.SD, function(colu) {
    regmatches(colu, aregexec("setosh", colu, max.distance = 2))
})]

This will return a data.table, with each cell containing either the extracted fuzzy-matched substring, or an empty string if there was no match. Depending on the results you get with your real data, you may need to adjust max.distance to tweak the fuzziness of the match:

       V1     V2     V3
1: setosh              
2:        setosh setosa
3: setosh        seposa
4: seposa        setosh
5: seposa

Return vector of words matched with fuzzy matching

There are 2 best solutions below

Related Questions in R

Related Questions in AGREP

Trending Questions

Popular # Hahtags

Popular Questions