I am using agrepl() to filter a data.table by fuzzy matching a word. This is working fine for me, using something like this:
library(data.table)
data <- as.data.table(iris)
pattern <- "setosh"
dt <- data[, lapply(.SD, function(x) agrepl(paste0("\\b(", pattern, ")\\b"), x, fixed = FALSE, ignore.case = TRUE))]
data<- data[rowSums(dt) > 0]
head(data)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1: 5.1 3.5 1.4 0.2 setosa
2: 4.9 3.0 1.4 0.2 setosa
3: 4.7 3.2 1.3 0.2 setosa
4: 4.6 3.1 1.5 0.2 setosa
5: 5.0 3.6 1.4 0.2 setosa
6: 5.4 3.9 1.7 0.4 setosa
Obviously you can see by looking at this that "setosh" will have been fuzzy matched to "setosa" in this instance. What I want is to get a vector of words that have been matched to "setosh". So although not relevant in this example, if it had included another category like "seposh", that would have matched too, so you'd have a vector that is c("setosa", "seposh").
EDIT:
Thanks for the answer below - I can see how it's possible to isolate the values where the fuzzy matching occurs when just looking at a vector, but my issues are:
- I only want the string that has matched, not the entire value.
- I'm having trouble replicating this over my data.table.
For e.g., if I change a value to make this point a bit more easily...
data <- as.data.table(iris)
data[Species == "versicolor", Species := "setosh species"] # changing a value so it would match
pattern <- "setosh"
dt <- data[, lapply(.SD, function(x) agrep(paste0("\\b(", pattern, ")\\b"), x, value = TRUE, fixed = FALSE, ignore.case = TRUE))]
Warning messages:
1: In as.data.table.list(jval) :
Item 1 is of size 0 but maximum size is 100, therefore recycled with 'NA'
2: In as.data.table.list(jval) :
Item 2 is of size 0 but maximum size is 100, therefore recycled with 'NA'
3: In as.data.table.list(jval) :
Item 3 is of size 0 but maximum size is 100, therefore recycled with 'NA'
4: In as.data.table.list(jval) :
Item 4 is of size 0 but maximum size is 100, therefore recycled with 'NA'
unique(dt)
Species
1: setosa
2: setosh species
You can see that I haven't got the result in a vector, and that the result includes the full value "setosh species" rather than just "setosh" (as the part that matched).
Hope that's more helpful!
Just use the output of
agrepas an index for a character vector you are grepping.result:
EDIT: OK, but what if we want as result only the matched string? Not the whole thing, but just the part that was matched? Then we are in for a bit of fun, because grep/grepl and agrep/agrepl don't work that way. Luckily, there is the
aregexecfunction.matchesnow contains a list with one element for each element ofvec. Each element of this list contains a single number – start of the match – with an attributematch.length:We can use these numbers to extract the matched strings.
FINAL EDIT:
I am not sure what this business with grepping all columns of
irisis about, but to get a vector of matched substrings in the Species column I would do the following:With res, we can do Stuff. We can remove the NA's and take a look at unique values:
Result:
FINAL FINAL EDIT: It appears that the example chosen by the OP was not exactly what they had in mind. Thus, we are going to make another example.
datais now a data.table and in each column there are numerous things to match. If we only want to know what kind of matches are there, and we don't need to know in which columns and rows these matches were found, and we want to search through all columns, then we don't need it to be a two-dimensional object. Better make it a vector:OK, but if all that you want is to get the unique matches, we can simplify it even further:
Now we have a character vector. If you now use
aregexecto find your matches and extract the matches as described above you will end up with a character vector whichThe output will be: