Our ruby on rails app uses aho corasick gem to find if any given text contains any of the prelisted bad words (these are picked from a static config when loading the app).
But, using this is giving a few false positives. For example if my bad word from config is "abc", then the text containing "habcd" is also being flagged, which is not the intent.
So, I tried changing the config word from "abc" to " abc " (space added before and after the word). However, this has another drawback that a text like "abc is xyz" will not be flagged, where as it is supposed to be. So, i have to add another 2 words - "abc " and " abc" to my config as well, similarly i would need to add "-abc", "abc-", ":abc", etc. to my config, making the config pretty big, as there are many such words, apart from abc.
So, I was thinking if there is some kind of regular expression that I can enter in my config like [",-" "]abc[",-" "] so that all the above cases would be covered and no false positives will be found.
We use gem 'aho_corasick', '0.1.0' , with ruby - 1.9.3 and rails - 3.2.8
Any help is greatly appreciated. Thanks in advance!! :)
The simplest way to solve this problem is to use the standard implementation to get all the matches, then remove matches which don't have a word delimiter before and after the first and last character. In the average case, there won't be a significant performance hit because you will have few matches.