I'm debugging an existing Boost QI grammar in our software for parsing "endpoints" (host:port, where host can be a hostname, IPv4 address, or IPv6 address). I'm specifically having a problem with the hostname part (the port, IPv4, and IPv6 parts all work perfectly). The full code is extensive, so I've simplified it down to the following example, which can be easily run on Wandbox (C++17, Boost 1.79.0)
#include <iostream>
#include <string>
#include <boost/fusion/include/std_pair.hpp>
#include <boost/fusion/include/boost_array.hpp>
#include <boost/phoenix.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/algorithm/string.hpp>
class Endpoint
{
public:
Endpoint(): _host(""), _port(0) {};
std::string _host;
unsigned short _port;
};
template < typename ITERATOR >
struct HostnameGrammar :
boost::spirit::qi::grammar< ITERATOR, std::string() >
{
HostnameGrammar()
:
HostnameGrammar::base_type( start )
{
using boost::spirit::qi::alnum;
using boost::spirit::qi::alpha;
using boost::spirit::qi::char_;
using boost::spirit::_1;
start %=
dottedName | singleName
;
dottedName %=
+( singleName >> char_( '.' ) ) >>
tld
;
singleName %=
+alnum >> *( char_('-') >> +alnum )
;
tld %=
+alpha >> *( -char_('-') >> +alnum )
;
}
boost::spirit::qi::rule< ITERATOR, std::string() > start;
boost::spirit::qi::rule< ITERATOR, std::string() > dottedName;
boost::spirit::qi::rule< ITERATOR, std::string() > singleName;
boost::spirit::qi::rule< ITERATOR, std::string() > tld;
};
template < typename ITERATOR >
struct EndpointGrammar :
boost::spirit::qi::grammar< ITERATOR, std::string() >
{
EndpointGrammar(
Endpoint & endpoint )
:
EndpointGrammar::base_type( start ),
endpoint_( endpoint )
{
using boost::spirit::qi::ushort_;
using boost::spirit::_1;
start =
-address[ boost::phoenix::ref( endpoint._host ) = _1 ] >>
-( ':' >> ushort_[ boost::phoenix::ref( endpoint._port ) = _1 ] )
;
address %=
hostname;
}
Endpoint & endpoint_;
boost::spirit::qi::rule< ITERATOR, std::string() > start;
boost::spirit::qi::rule< ITERATOR, std::string() > address;
HostnameGrammar< ITERATOR > hostname;
};
int main()
{
std::vector< std::string > endpointStrings {
// these should parse successfully (and do)
"0foo", "foo", "foo.net", "foo:1234", "foo.net:5678", "foo.example.net", "foo.example.net:9012",
"foo-bar", "foo-bar.com", "foo-bar:1234", "foo-bar.net-0:5678", "foo-bar.example.com-1:9012",
// these should fail to parse (and do)
"foo.0bar", "foo.0bar:1234", "foo.bar-", "foo.bar-:1234", "foo-", "-foo"
};
for ( auto const & endpointString : endpointStrings )
{
Endpoint tempEndpoint;
std::string::const_iterator beginIt( endpointString.begin() );
std::string::const_iterator endIt( endpointString.end() );
EndpointGrammar< std::string::const_iterator > grammar( tempEndpoint );
if ( !boost::spirit::qi::parse( beginIt, endIt, grammar ) || beginIt != endIt )
{
std::cout << "Failed: " << endpointString << std::endl;
}
else
{
std::cout << "Succeeded: " << endpointString << " = " << tempEndpoint._host << " / " << tempEndpoint._port << std::endl;
}
}
return 0;
}
The grammar successfully parses or fails to parse all the examples that it should parse or fail to parse per RFC rules, so that great. The problem is that whatever part is the "last" part of the hostname gets "doubled." Here's the output of running that:
Succeeded: 0foo = 0foo0foo / 0
Succeeded: foo = foofoo / 0
Succeeded: foo.net = foo.netnet / 0
Succeeded: foo:1234 = foofoo / 1234
Succeeded: foo.net:5678 = foo.netnet / 5678
Succeeded: foo.example.net = foo.example.netnet / 0
Succeeded: foo.example.net:9012 = foo.example.netnet / 9012
Succeeded: foo-bar = foo-barfoo-bar / 0
Succeeded: foo-bar.com = foo-bar.comcom / 0
Succeeded: foo-bar:1234 = foo-barfoo-bar / 1234
Succeeded: foo-bar.net-0:5678 = foo-bar.net-0net-0 / 5678
Succeeded: foo-bar.example.com-1:9012 = foo-bar.example.com-1com-1 / 9012
Failed: foo.0bar
Failed: foo.0bar:1234
Failed: foo.bar-
Failed: foo.bar-:1234
Failed: foo-
Failed: -foo
Note "foofoo", "foo.netnet", "foo.example.netnet", etc.
I've tried about 300 different variations of these rules—far too many to include them all here. Suffice it to say that many variations successfully identify valid vs invalid hostnames, but all those that do so correctly also suffer from the same duplicated last-part problem. I'm all out of ideas over here. Anyone know why this doesn't work and how to fix it?
Note that I've even tried the simplest possible alphabetic-only rule to achieve what I'm trying to achieve in order to expand on that rule with my more-complicated character sets, but even this duplicates "foo.net" into "foo.netnet":
start %=
+( +alpha >> char_( '.' ) ) >> +alpha
;
Without even looking, I know you've run into non-atomic attribute propagation into container attributes, see e.g.
The classical fix would be to apply
qi::hold[]- keeping in mind the performance cost under backtracking.Proof & "How To Debug"?
Slapping on some BOOST_SPIRIT_DEBUG* macros, let's zoom in on the
foo.netnetcase:Prints (excerpt, Live On Coliru)
As you can see, the first
singleNameincludes.net, only to backtrack because no more'.'follows. Thentldmatches.The "tedious" ways to fix it would be to use
hold[](Live):Or using a lookahead assertion (less efficient!, Live)
Simplify!
Both work. But since all you do at current is propagate the input 1:1 to the resulting string, you can do everything a lot simpler, using
raw[],operator%and literals instead ofchar_:In this setup,
tld,singleNameanddottedNameneed not build an attribute at all, so backtracking is irrelevant. See it Live With Full Test CasesPrinting
Code Review: The Problem Of Semantic Actions
However looking at the code I notice "gratuitous" semantic actions. They, too, are not atomic. See this staple on SA: Boost Spirit: "Semantic actions are evil"?
I'd strongly suggest avoiding them unless absolutely necessary. This also makes the parser stateless, meaning you don't need to construct and compile the rules/expressions for every parse.
Live On Coliru
With the same output as before, but much smaller code, much shorter compile times.
Grammar/AST Adaptation
To make it easier to pass your application AST type instead of binding separate host/port string attributes, you might want to use
std::pairor adapt your custom type:Live On Coliru
Still with the same output as ever. Note how the grammar is stateless.