I’m experimenting with boost::spirit to write a URL parser. My objective is to parse the input URL (valid or invalid) and break it down into prefix, host and suffix as below:
Input ipv6 URL: https://[::ffff:192.168.1.1]:8080/path/to/resource
Break this into below parts:
Prefix: https://
Host: ::ffff:192.168.1.1
Suffix: :8080/path/to/resource
Input ipv6 URL: https://::ffff:192.168.1.1/path/to/resource
Break this into below parts:
Prefix: https://
Host: ::ffff:192.168.1.1
Suffix: /path/to/resource
Input ipv4 URL: https://192.168.1.1:8080/path/to/resource
Break this into below parts:
Prefix: https://
Host: 192.168.1.1
Suffix: :8080/path/to/resource
The colon character ‘:’ is used as delimiter in ipv6 address and also as delimiter for port in ipv4 address. Due to this ambiguity, I’m having hard time defining the boost::spirit grammar that works both for ipv4 and ipv6 URLs. Please refer the code below:
struct UrlParts
{
std::string scheme;
std::string host;
std::string port;
std::string path;
};
BOOST_FUSION_ADAPT_STRUCT(
UrlParts,
(std::string, scheme)
(std::string, host)
(std::string, port)
(std::string, path)
)
void parseUrl_BoostSpirit(const std::string &input, std::string &prefix, std::string &suffix, std::string &host)
{
namespace qi = boost::spirit::qi;
// Define the grammar
qi::rule<std::string::const_iterator, UrlParts()> url = -(+qi::char_("a-zA-Z0-9+-.") >> "://") >> -qi::lit('[') >> +qi::char_("a-fA-F0-9:.") >> -qi::lit(']') >> -(qi::lit(':') >> +qi::digit) >> *qi::char_;
// Parse the input
UrlParts parts;
auto iter = input.begin();
if (qi::parse(iter, input.end(), url, parts))
{
prefix = parts.scheme.empty() ? "" : parts.scheme + "://";
host = parts.host;
suffix = (parts.port.empty() ? "" : ":" + parts.port) + parts.path;
}
else
{
host = input;
}
}
above code produces incorrect output for ipv4 URL as below:
Input URL ipv4: https://192.168.1.1:8080/path/to/resource
Broken parts:
Prefix: https://
Host: 192.168.1.1:8080
Suffix: /path/to/resource
i.e. Host is having :8080 instead of having it in Suffix.
If I change the URL grammar, I can fix the ipv4 but then ipv6 breaks.
Of-course this can be done using trivial if-else parsing logic, but I'm trying to do it more elegantly using boost::spirit. Any suggestions on how to update the grammar to support both ipv4 and ipv6 URLs ?
PS: I'm aware that URLs with ipv6 address w/o [ ] are invalid as per RFC, but the application I'm working on requires processing these invalid URLs as well.
Thanks in advance!
First off your expression
char_("+-.")
accidentally allows for ',' inside the scheme: https://coliru.stacked-crooked.com/a/14c00775d9f3d99eTo innoculate against that always put
-
first or last in character sets so it can't be misinterpreted as a range:char_("+.-")
. Yeah, that's subtle.-'[' >> p >> -']'
allows for unmatched brackets. Instead say('[' >> p >> ']' | p)
.With those applied, let's rewrite the parser expression so we see what's happening:
So I went on to create a test-bed to demonstrate your question examples:
Live On Coliru
Prints:
The Problem
You already assessed the problem:
:8080
matches the production forhost_
. I'd reason that the port specification is the odd one out because it must be the last before'/'
or the end of input. In other words:Now you can do a negative look-ahead assertion in your
host_
production to avoid eating port specifications:Now the output becomes
Live On Coliru
Note that there are some inefficiencies and probably RFC violations in this implementation. Consider a static instance of the grammar. Also consider using X3.
Using X3 and Asio
I have a related answer here: What is the nicest way to parse this in C++?. It shows an X3 approach with validation using Asio's networking primitives.
Boost URL
Why roll your own?
To be really pedantic and get the
://
as well:This parses what you have and much more (fragment from the Reference Help Card):
The notable value is
[]
)Live On Coliru