Google RE2 doesn't recognize Unicode escape in regex

1.4k Views Asked by At

I am developing an application, in C++, that validates configuration files with regex by using the Google RE2 library. The contents of the configuration files are read into an std::string;

So far, I declared this string that contains the regex expression:

const string EXPR_FAILED_FILE(R"([^\u0020-\u007E\n]|(\b.*(Mensagem|Antes|Loop|Movimentar|\|).*)|\\[0-9]{3,4})");

However, in this implementation below I am having some issues to detect some invalid characters in my test string (strInput)

bool checkStringConsistency(const string& strInput){
    RE2 re(EXPR_FAILED_FILE);
    bool b_matches = RE2::FullMatch(strInput, re);
    return b_matches;
}

When I run the code, I am getting these messages in the stderr:

re2/re2.cc:205: Error parsing '[^\u0020-\u007E\n]|(\b.*(Mensagem|Antes|Loop|Movimentar|\|).*)|\\[0-9]{3,4}': invalid escape sequence: \u
re2/re2.cc:890: Invalid RE2: invalid escape sequence: \u

It seems that the RE2 are not recognizing the \u sequence to seek a Unicode range of characters. I tested this expression at regexr.com and the invalid characters was detected normally there.

What could be wrong here?

1

There are 1 best solutions below

1
On BEST ANSWER

Each regex engine has its own syntax and in RE2 you need to use [^\x{0020}-\x{007E}\n] instead of [^\u0020-\u007E\n]. See the syntax document:

Escape sequences:
\a  bell (== \007)
\f  form feed (== \014)
\t  horizontal tab (== \011)
\n  newline (== \012)
\r  carriage return (== \015)
\v  vertical tab character (== \013)
\*  literal «*», for any punctuation character «*»
\123    octal character code (up to three digits)
\x7F    hex character code (exactly two digits)
\x{10FFFF}  hex character code
\C  match a single byte even in UTF-8 mode
\Q...\E literal text «...» even if «...» has punctuation

\u is used to match an uppercase character and is marked as NOT SUPPORTED