I'm writing a simple expressions parser. It is build on a Boost.Spirit.Qi grammar based on Boost.Spirit.Lex tokens (Boost in version 1.56).
The tokens are defined as follows:
using namespace boost::spirit;
template<
typename lexer_t
>
struct tokens
: lex::lexer<lexer_t>
{
tokens()
: /* ... */,
variable("%(\\w+)")
{
this->self =
/* ... */ |
variable;
}
/* ... */
lex::token_def<std::string> variable;
};
Now I would like the variable
token value to be just the name (the matching group (\\w+)
) without prefix %
symbol. How do I do that?
Using a matching group by itself doesn't help. Still value is full string, including the prefix %
.
Is there any way to force using of a matching group?
Or in at least somehow refer to it within action of the token?
I tried also using action like this:
variable[lex::_val = std::string(lex::_start + 1, lex::_end)]
but it failed to compile. Error claimed that none of the std::string
constructor overloads could match arguments:
(const boost::phoenix::actor<Expr>, const boost::spirit::lex::_end_type)
Even simpler
variable[lex::_val = std::string(lex::_start, lex::_end)]
failed to compile. With similar reason only first argument type was now boost::spirit::lex::_start_type
.
Finally I tried this (even though it looks like a big waste):
lex::_val = std::string(lex::_val).erase(0, 1)
but that also failed to compile. This time compiler was unable to convert from const boost::spirit::lex::_val_type
to std::string
.
Is there any way to deal with this problem?
Simple Solution
Correct form of constructing the
std::string
attribute value is following:exactly as suggested by jv_ in his (or her) comment.
boost::phoenix::construct
is provided by<boost/phoenix/object/construct.hpp>
header. Or use<boost/phoenix.hpp>
.Regular Expression Solution
The above solution however works well only in simple cases. And excludes the possibility to have the pattern provided from outside (from configuration data in particular). Since changing the pattern for example to
%(\\w+)%
would require to change the value construction code.That is why it would be much better to be able to refer to capture groups from the regular expression defining the token.
Now note that this still isn't perfect since weird cases like
%(\\w+)%(\\w+)%
would still require change in the code to be handled correctly. That could be worked around by configuring not only the regex for the token but also means to form the value from the matched range. Yet this goes out of the scope of the question. Using capture groups directly seems flexible enough for many cases.sehe in a comment elsewhere stated, that there is no way to use capture groups from token's regular expression. Not to mention that tokens actually support only a subset of regular expressions. (Among notable differences there is for example lack of support for naming capture groups or ignoring them!).
My own experiments in this area support that as well. There is no way to use capture groups sadly. There is a workaround however - you have to just re-apply the regex in your action.
Action Obtaining Capture Range
To make it a little bit modular let's start with a simplest task - an action which returns
boost::iterator_range
part of the token's match corresponding to specified capture.The action uses Boost.Regex (include
<boost/regex.hpp>
).Action Obtaining Capture as String
Now as the capture range is a nice thing to have as it doesn't allocate any new memory for the string, it is the string that we want in the end after all. So here another action build upon the previous one.
No magic here. We just make an
std::basic_string
from the range returned by the simpler action.Action Assigning Value From the Capture
Actions that return a value are of little use for us. Ultimate goal is to set token value from the capture. And this is done by the last action.
Discussion
The actions are used like this:
Optionally you can provide a second argument being the index of capture to use. It defaults to
1
which seems suitable in most cases.Creating Functions
set_val_from_capture
(orget_capture_as_string
orget_capture
respectively) is an auxiliary function used for automatic deduction of template arguments from thetoken_def
. In particular what we need is theChar
type to make corresponding regular expression.I'm not sure if this could be reasonably avoided and even if so then it would significantly complicated the call operator (especially if we would strive to cache the regex object instead of building it each time anew). My doubts come mostly from not being sure whether
Char
type oftoken_def
is required to be the same as the tokenized sequence character type or not. I assumed that they don't have to be the same.Repeating the Token
Definitely unpleasant part of the action is the need to provide the token itself as an argument making a repetition.
The token is however needed for the
Char
type as described above and to... get the regular expression!It seems to me that at least in theory we could be able to obtain the token somehow "at run-time" based on
id
argument to the action (which we just ignore currently). However I failed to find any way how to obtaintoken_def
based on token's identifier regardless whether fromcontext
argument or the lexer itself (which could be passed to the action asthis
through creating function).Reusability
Since those are actions they are not really reusable (out of the box) in more complex scenarios. For example if you would like to not only get just the capture but also convert it to some numeric value you would have to write another action this way instead of making a complex action at the token.
At first I tried to achieve something like this:
It seems like more flexible as you could easily add more code around it - like for example wrap it in some conversion function.
But I failed to achieve it. Although I feel like I didn't try hard enough. Learning more about Boost.Phoenix would surely help here a lot.
Double Work
All this workaround doesn't prevent us from doing double work. Both at regex parsing and then matching. But as mentioned in the beginning it seems that there is no better way (without altering Boost.Spirit itself).