Which Xpressive approach is best to reduce stack usage

116 Views Asked by At

I'm using Xpressive extensively in my current embedded C++ project.

As I know, Xpressive is a great user of the stack. But are there Xpressive regex approaches that are more stack efficient?

E.g. a regex to match a string representing a 32-bit integer may need to test if digit number six is less than or equal to 6.

Xpressive (and other regex engines too, I know) allows numerous approaches, e.g.:

range('0','6')

or

('0'|'1'|'2'|'3'|'4'|'5'|'6')

or

set['0'|'1'|'2'|'3'|'4'|'5'|'6']

and then the regex may allow for 3 following digits, e.g.:

repeat<3>(_d) >> _b

or

_d >> _d >> _d >> _b

But, given the choice, and not caring too much about source code layout, which approach is optimal for:

a) stack;

b) speed;

c) ?

1

There are 1 best solutions below

0
On

Sequencing and repetition use up stack because they require backtracking. Let's look at your examples.

range('0','6')

This is implemented as a quick check to see if a character falls within a character range. It doesn't use sequencing or repetition, so it doesn't gobble stack space.

('0'|'1'|'2'|'3'|'4'|'5'|'6')

In general, don't put in an alternate what can be put in a set or a range. They are far more efficient for this kind of thing. I'll note that obviously this isn't a valid xpressive regex. You would write it as:

(as_xpr('0')|'1'|'2'|'3'|'4'|'5'|'6')

I'm sure you know that. I'll make a similar edit for this one:

set[as_xpr('0')|'1'|'2'|'3'|'4'|'5'|'6']

Sets are fast. They are implemented as a look-up table for the ascii range and (for Unicode) a sparse vector for character codes above 256. Because of the sparseness of some sets, this scheme can take up more heap space than a simple range. But that doesn't effect the amount of stack space used when matching one. I'll note you can also write it as:

(set= '1','2','3','4','5','6')

Now for repetition:

repeat<3>(_d) >> _b

As I said before, repetition is usually implemented with recursion to get the backtracking right. But in this case, xpressive recognizes that _d can only ever eat one character, and that it's being repeated exactly 3 times. So in this limited case, it's implemented with a loop to conserve stack space. This is in contrast to the next case:

_d >> _d >> _d >> _b

Xpressive isn't smart enough to treat _d >> _d >> _d the same way as repeat<3>(_d). As a result, this will consume a separate stack frame for each _d.

Short version: prefer specialized constructs where you can over their more general equivalents. That gives xpressive more context for optimization.

In addition, get to know independent sub-expressions. If you can put something in a keep(), go for it. Why? keep() matches a sub-expression, then reclaims the stack space used by that sub-match. That makes it impossible to try different things by backtracking into that sub-expression, but sometimes you don't need that.

Hope that helps.