How to match only whole words with Aho corasick?

382 Views Asked by At

Our ruby on rails app uses aho corasick gem to find if any given text contains any of the prelisted bad words (these are picked from a static config when loading the app).

But, using this is giving a few false positives. For example if my bad word from config is "abc", then the text containing "habcd" is also being flagged, which is not the intent.

So, I tried changing the config word from "abc" to " abc " (space added before and after the word). However, this has another drawback that a text like "abc is xyz" will not be flagged, where as it is supposed to be. So, i have to add another 2 words - "abc " and " abc" to my config as well, similarly i would need to add "-abc", "abc-", ":abc", etc. to my config, making the config pretty big, as there are many such words, apart from abc.

So, I was thinking if there is some kind of regular expression that I can enter in my config like [",-" "]abc[",-" "] so that all the above cases would be covered and no false positives will be found.

We use gem 'aho_corasick', '0.1.0' , with ruby - 1.9.3 and rails - 3.2.8

Any help is greatly appreciated. Thanks in advance!! :)

1

There are 1 best solutions below

0
On

The simplest way to solve this problem is to use the standard implementation to get all the matches, then remove matches which don't have a word delimiter before and after the first and last character. In the average case, there won't be a significant performance hit because you will have few matches.