C++11: Safe practice with regex of two possible number of matches

356 Views Asked by At

With this regex, I would like to match time with or without a milliseconds (ms) field. For completeness, I write the regex here (I removed the anchors in regex101 to enable multi-line):

^(0[0-9]|1[0-9]|2[0-3]):([0-5][0-9]):([0-5][0-9])(?:|(?:\.)([0-9]{1,6}))$

I kind of don't understand the C++ behavior of this. Now you see in regex101, the number of capture groups depends on the string. If there's no ms, it's 3+1 (since C++ uses match[0] for the matched pattern), and if there's ms, then it's 4+1. But then in this example:

std::regex timeRegex = std::regex(R"(^(0[0-9]|1[0-9]|2[0-3]):([0-5][0-9]):([0-5][0-9])(?:|(?:\.)([0-9]{1,6}))$)");
std::smatch m;
std::string strT = std::string("12:00:09");
bool timeMatch = std::regex_match(strT, m, timeRegex);
std::cout<<m.size()<<std::endl;
if(timeMatch)
{
    std::cout<<m[0]<<std::endl;
    std::cout<<m[1]<<std::endl;
    std::cout<<m[2]<<std::endl;
    std::cout<<m[3]<<std::endl;
    std::cout<<m[4]<<std::endl;
}

We see that m.size() is always 5, whether there is or not an ms field! m[4] is an empty string if there's no ms field. Is this behavior the default one in regex of C++? Or should I try/catch (or some other safety measure) when in doubt of the size? I mean... even the size is a little misleading here!

3

There are 3 best solutions below

0
On BEST ANSWER

m.size() will always be the number of marked subexpressions in your expression plus 1 (for the whole expression).

In your code you have 4 marked subexpressions, whether these are matched or not has no effect on the size of m.

If you want to now if there are milliseconds, you can check:

m[4].matched
0
On

std::smatch (a.k.a. std::match_results<std::string::const_iterator>) is basically a container that holds elements of type std::sub_match. The first element is the match results for your full regexp expression, and the subsequent ones hold matches for each sub-expression. Since you have 4 sub-expressions if your pattern, you are getting 5 results (4 + full match).

1
On
m.size();// Returns the number of match results. 
         // a string is allocated for each 'Capture Group' 
         // and filled with the match substring. 

Since smatch is a match_results

(see) http://www.cplusplus.com/reference/regex/match_results/

size returns the number of matches it ALLOCATED which is based on the number of capture groups your regex contains.

Capture Groups:

Parentheses group the regex between them. They capture the text matched by the regex inside them into a numbered group that can be reused with a numbered backreference. They allow you to apply regex operators to the entire grouped regex.

http://www.regular-expressions.info/refcapture.html

So that is why your size is going to be allocated as 5 no matter what you end up filling with regex_match(). As others have notex, the fifth being the full match.

See: What does std::match_results::size return?