Adding regex into a censor system for C++

278 Views Asked by At

I've been trying to create a censor system for the WoW emulator called TrinityCore for a while now. What I basically do is fill a database table (chat_filter) with 'bad words', fill a vector with these on startup and on every chat line that is made by a player, it gets checked against the content of my vector. If it contains a bad word, this gets replaced by ** (whereas the amount of *'s is also going to be taken from a column from the database table (todo)) and the player gets a punishment (muted or so).

Now what I'm having trouble with, is how to make a proper filter. Right now you'd have to add every possible combination of a word you can think of, for example 'a.s.s.' should also be read as 'ass', and I have no idea how to do this!

Here's the important part of the current code, I left out the DB pulling as it wouldn't have any use anyway (and it'd make it less clear as it's in a different file).

char* msg3 = strdup(msg.c_str());
char* words = strtok(msg3, " ,.-()&^%$#@!{}'<>/?|\\=+-_1234567890"); // This splits the sentence in seperated words and removes the symbols
ObjectMgr::ChatFilterContainer const& censoredWords = sObjectMgr->GetCensoredWords();

while (words != NULL && !censoredWords.empty())
{
    for (uint32 i = 0; i < censoredWords.size(); ++i)
    {  
        if (!stricmp(censoredWords[i].c_str(), words))
        {
            sLog->outString("%s", words);
            //msg.replace(msg.begin(), msg.end(), msg.c_str(), "***");
            msg.replace(msg.begin(), msg.end(), censoredWords[i].c_str(), '*');
        }
        //msg.replace(msg.begin(), msg.end(), censoredWords[i].c_str(), /*replacement*/ "***");
        //msg.replace(msg.find(censoredWords[i].c_str()), censoredWords.size(), 
    }

    words = strtok(NULL, " ,.-()&^%$#@!{}'<>/?|\=+-_1234567890");
}

Thanks in advance,

Jasper

P.S. 'GetCensoredWords' returns the vector.

P.S.S. 'msg' is a std::string - it's the ACTUAL message the player sent.

1

There are 1 best solutions below

0
On

I would use std::string not char* so the memory management is all automatic. That would solve the problem of leaking memory in your example code. Boost.Algorithm provides a powerful boost::algorithm::split function which is much better than strtok.

It's not feasible to store every possible permutation of censored word, especially if you're going to loop over the whole set of words for every input. If you want to censor "fubar" you'd have to store "Fubar" and "FUbar" and FuBaR" and "fub4r" and "F.U.B.A.R" and "f.u.b.a.r" etc. etc.

Instead you could store each censored word only once, in a normalised form, e.g. "fubar", then convert each word of input to the normalised form. So if the user enters "F-u-B-a-R" you normalise it to "fubar" then you can do a simple lookup into the set of censored words (which can use an associate container so the lookup is O(log n) or even O(1))