Regex character class subtraction in C++

578 Views Asked by At

I'm writing a C++ program that will need to take regular expressions that are defined in a XML Schema file and use them to validate XML data. The problem is, the flavor of regular expressions used by XML Schemas does not seem to be directly supported in C++.

For example, there are a couple special character classes \i and \c that are not defined by default and also the XML Schema regex language supports something called "character class subtraction" that does not seem to be supported in C++.

Allowing the use of the \i and \c special character classes is pretty simple, I can just look for "\i" or "\c" in the regular expression and replace them with their expanded versions, but getting character class subtraction to work is a much more daunting problem...

For example, this regular expression that is valid in an XML Schema definition throws an exception in C++ saying it has unbalanced square brackets.

#include <iostream>
#include <regex>

int main()
{
    try
    {
        // Match any lowercase letter that is not a vowel
        std::regex rx("[a-z-[aeiuo]]");
    }
    catch (const std::regex_error& ex)
    {
        std::cout << ex.what() << std::endl;
    }
}

How can I get C++ to recognize character class subtraction within a regex? Or even better, is there a way to just use the XML Schema flavor of regular expressions directly within C++?

4

There are 4 best solutions below

0
tjwrona On BEST ANSWER

Okay after going through the other answers I tried out a few different things and ended up using the xmlRegexp functionality from libxml2.

The xmlRegexp related functions are very poorly documented so I figured I would post an example here because others may find it useful:

#include <iostream>
#include <libxml/xmlregexp.h>

int main()
{
    LIBXML_TEST_VERSION;

    xmlChar* str = xmlCharStrdup("bcdfg");
    xmlChar* pattern = xmlCharStrdup("[a-z-[aeiou]]+");
    xmlRegexp* regex = xmlRegexpCompile(pattern);

    if (xmlRegexpExec(regex, str) == 1)
    {
        std::cout << "Match!" << std::endl;
    }

    free(regex);
    free(pattern);
    free(str);
}

Output:

Match!

I also attempted to use the XMLString::patternMatch from the Xerces-C++ library but it didn't seem to use an XML Schema compliant regex engine underneath. (Honestly I have no clue what regex engine it uses underneath and the documentation for that was pretty abysmal and I couldn't find any examples online so I just gave up on it.)

8
Acorn On

Character ranges subtraction or intersection is not available in any of the grammars supported by std::regex, so you will have to rewrite the expression into one of the supported ones.

The easiest way is to perform the subtraction yourself and pass the set to std::regex, for instance [bcdfghjklvmnpqrstvwxyz] for your example.

Another solution is to find either a more featureful regular expression engine or a dedicated XML library that supports XML Schema and its regular expression language.

4
Bob On

Starting from the cppreference examples

#include <iostream>
#include <regex>
 
void show_matches(const std::string& in, const std::string& re)
{
    std::smatch m;
    std::regex_search(in, m, std::regex(re));
    if(m.empty()) {
        std::cout << "input=[" << in << "], regex=[" << re << "]: NO MATCH\n";
    } else {
        std::cout << "input=[" << in << "], regex=[" << re << "]: ";
        std::cout << "prefix=[" << m.prefix() << "] ";
        for(std::size_t n = 0; n < m.size(); ++n)
            std::cout << " m[" << n << "]=[" << m[n] << "] ";
        std::cout << "suffix=[" << m.suffix() << "]\n";
    }
}
 
int main()
{
    // greedy match, repeats [a-z] 4 times
    show_matches("abcdefghi", "(?:(?![aeiou])[a-z]){2,4}");
}

You can test and check the the details of the regular expression here.

The choice of using a non capturing group (?: ...) is to prevent it from changing your groups in case you will use it in a bigger regular expression.

(?![aeiou]) will match without consuming the input if finds a character not matching [aeiou], the [a-z] will match letters. Combining these two condition is equivalent to your character class subtraction.

The {2,4} is a quantifier that says from 2 to 4, could also be + for one or more, * for zero or more.

Edit

Reading the comments in the other answer I understand that you want to support XMLSchema.

The next program shows how to use ECMA regular expression to translate the "character class differences" to a ECMA compatible format.

#include <iostream>
#include <regex>
#include <string>
#include <vector>

std::string translated_regex(const std::string &pattern){
    // pattern to identify character class subtraction
    std::regex class_subtraction_re(
       "\\[((?:\\\\[\\[\\]]|[^[\\]])*)-\\[((?:\\\\[\\[\\]]|[^[\\]])*)\\]\\]"
    );
    // translate the regular expression to ECMA compatible
    std::string translated = std::regex_replace(pattern, 
       class_subtraction_re, "(?:(?![$2])[$1])");
    return translated;
}
void show_matches(const std::string& in, const std::string& re)
{
    std::smatch m;
    std::regex_search(in, m, std::regex(re));
    if(m.empty()) {
        std::cout << "input=[" << in << "], regex=[" << re << "]: NO MATCH\n";
    } else {
        std::cout << "input=[" << in << "], regex=[" << re << "]: ";
        std::cout << "prefix=[" << m.prefix() << "] ";
        for(std::size_t n = 0; n < m.size(); ++n)
            std::cout << " m[" << n << "]=[" << m[n] << "] ";
        std::cout << "suffix=[" << m.suffix() << "]\n";
    }
}



int main()
{
    std::vector<std::string> tests = {
        "Some text [0-9-[4]] suffix", 
        "([abcde-[ae]])",
        "[a-z-[aei]]|[A-Z-[OU]] "
    };
    std::string re = translated_regex("[a-z-[aeiou]]{2,4}");
    show_matches("abcdefghi", re);
    
    for(std::string test : tests){
       std::cout << " " << test << '\n' 
        << "   -- " << translated_regex(test) << '\n'; 
    }
    
    return 0;
}

Edit: Recursive and Named character classes

The above approach does not work with recursive character class negation. And there is no way to deal with recursive substitutions using only regular expressions. This rendered the solution far less straight forward.

The solution has the following levels

  • one function scans the regular expression for a [
  • when a [ is found there is a function to handle the character classes recursively when '-[` is found.
  • The pattern \p{xxxxx} is handled separately to identify named character patterns. The named classes are defined in the specialCharClass map, I fill two examples.
#include <iostream>
#include <regex>
#include <string>
#include <vector>
#include <map>

std::map<std::string, std::string> specialCharClass = {
    {"IsDigit", "0-9"},
    {"IsBasicLatin", "a-zA-Z"}
    // Feel free to add the character classes you want
};

const std::string getCharClassByName(const std::string &pattern, size_t &pos){
    std::string key;
    while(++pos < pattern.size() && pattern[pos] != '}'){
        key += pattern[pos];
    }
    ++pos;
    return specialCharClass[key];
}

std::string translate_char_class(const std::string &pattern, size_t &pos){
    
    std::string positive;
    std::string negative;
    if(pattern[pos] != '['){
        return "";
    }
    ++pos;
    
    while(pos < pattern.size()){
        if(pattern[pos] == ']'){
            ++pos;
            if(negative.size() != 0){
                return "(?:(?!" + negative + ")[" + positive + "])";
            }else{
                return "[" + positive + "]";
            }
        }else if(pattern[pos] == '\\'){
            if(pos + 3 < pattern.size() && pattern[pos+1] == 'p'){
                positive += getCharClassByName(pattern, pos += 2);
            }else{
                positive += pattern[pos++];
                positive += pattern[pos++];
            }
        }else if(pattern[pos] == '-' && pos + 1 < pattern.size() && pattern[pos+1] == '['){
            if(negative.size() == 0){
                negative = translate_char_class(pattern, ++pos);
            }else{
                negative += '|';
                negative = translate_char_class(pattern, ++pos);
            }
        }else{
            positive += pattern[pos++];
        }
    }
    return '[' + positive; // there is an error pass, forward it
}

std::string translate_regex(const std::string &pattern, size_t pos = 0){
    std::string r;
    while(pos < pattern.size()){
        if(pattern[pos] == '\\'){
            r += pattern[pos++];
            r += pattern[pos++];
        }else if(pattern[pos] == '['){
            r += translate_char_class(pattern, pos);
        }else{
            r += pattern[pos++];
        }
    }
    return r;
}

void show_matches(const std::string& in, const std::string& re)
{
    std::smatch m;
    std::regex_search(in, m, std::regex(re));
    if(m.empty()) {
        std::cout << "input=[" << in << "], regex=[" << re << "]: NO MATCH\n";
    } else {
        std::cout << "input=[" << in << "], regex=[" << re << "]: ";
        std::cout << "prefix=[" << m.prefix() << "] ";
        for(std::size_t n = 0; n < m.size(); ++n)
            std::cout << " m[" << n << "]=[" << m[n] << "] ";
        std::cout << "suffix=[" << m.suffix() << "]\n";
    }
}



int main()
{
    std::vector<std::string> tests = {
        "[a]",
        "[a-z]d",
        "[\\p{IsBasicLatin}-[\\p{IsDigit}-[89]]]",
        "[a-z-[aeiou]]{2,4}",
        "[a-z-[aeiou-[e]]]",
        "Some text [0-9-[4]] suffix", 
        "([abcde-[ae]])",
        "[a-z-[aei]]|[A-Z-[OU]] "
    };
    
    for(std::string test : tests){
       std::cout << " " << test << '\n' 
        << "   -- " << translate_regex(test) << '\n'; 
        // Construct a reegx (validate syntax)
        std::regex(translate_regex(test)); 
    }
    std::string re = translate_regex("[a-z-[aeiou-[e]]]{2,10}");
    show_matches("abcdefghi", re);
    
    return 0;
}

3
ralf htp On

Try using a library function from a library with XPath support, like xmlregexp in libxml (is a C library), it can handle the XML regexes and apply them to the XML directly

http://www.xmlsoft.org/html/libxml-xmlregexp.html#xmlRegexp

----> http://web.mit.edu/outland/share/doc/libxml2-2.4.30/html/libxml-xmlregexp.html <----

An alternative could have been PugiXML (C++ library, What XML parser should I use in C++? ) however i think it does not implement the XML regex functionality ...