URL parsing using boost::spirit

126 Views Asked by At

I’m experimenting with boost::spirit to write a URL parser. My objective is to parse the input URL (valid or invalid) and break it down into prefix, host and suffix as below:

Input ipv6 URL: https://[::ffff:192.168.1.1]:8080/path/to/resource
Break this into below parts:
Prefix: https://
Host: ::ffff:192.168.1.1
Suffix: :8080/path/to/resource

Input ipv6 URL: https://::ffff:192.168.1.1/path/to/resource
Break this into below parts:
Prefix: https://
Host: ::ffff:192.168.1.1
Suffix: /path/to/resource

Input ipv4 URL: https://192.168.1.1:8080/path/to/resource
Break this into below parts:
Prefix: https://
Host: 192.168.1.1
Suffix: :8080/path/to/resource

The colon character ‘:’ is used as delimiter in ipv6 address and also as delimiter for port in ipv4 address. Due to this ambiguity, I’m having hard time defining the boost::spirit grammar that works both for ipv4 and ipv6 URLs. Please refer the code below:

struct UrlParts
{
    std::string scheme;
    std::string host;
    std::string port;
    std::string path;
};

BOOST_FUSION_ADAPT_STRUCT(
    UrlParts,
    (std::string, scheme)
    (std::string, host)
    (std::string, port)
    (std::string, path)
)

void parseUrl_BoostSpirit(const std::string &input, std::string &prefix, std::string &suffix, std::string &host)
{
    namespace qi = boost::spirit::qi;

    // Define the grammar
    qi::rule<std::string::const_iterator, UrlParts()> url = -(+qi::char_("a-zA-Z0-9+-.") >> "://") >> -qi::lit('[') >> +qi::char_("a-fA-F0-9:.") >> -qi::lit(']') >> -(qi::lit(':') >> +qi::digit) >> *qi::char_;


    // Parse the input
    UrlParts parts;
    auto iter = input.begin();
    if (qi::parse(iter, input.end(), url, parts))
    {
        prefix = parts.scheme.empty() ? "" : parts.scheme + "://";
        host = parts.host;
        suffix = (parts.port.empty() ? "" : ":" + parts.port) + parts.path;
    }
    else
    {
        host = input;
    }
}

above code produces incorrect output for ipv4 URL as below:

Input URL ipv4: https://192.168.1.1:8080/path/to/resource
Broken parts:
Prefix: https://
Host: 192.168.1.1:8080
Suffix: /path/to/resource
i.e. Host is having :8080 instead of having it in Suffix.

If I change the URL grammar, I can fix the ipv4 but then ipv6 breaks.

Of-course this can be done using trivial if-else parsing logic, but I'm trying to do it more elegantly using boost::spirit. Any suggestions on how to update the grammar to support both ipv4 and ipv6 URLs ?

PS: I'm aware that URLs with ipv6 address w/o [ ] are invalid as per RFC, but the application I'm working on requires processing these invalid URLs as well.

Thanks in advance!

1

There are 1 best solutions below

3
On BEST ANSWER

First off your expression char_("+-.") accidentally allows for ',' inside the scheme: https://coliru.stacked-crooked.com/a/14c00775d9f3d99e

To innoculate against that always put - first or last in character sets so it can't be misinterpreted as a range: char_("+.-"). Yeah, that's subtle.

-'[' >> p >> -']' allows for unmatched brackets. Instead say ('[' >> p >> ']' | p).

With those applied, let's rewrite the parser expression so we see what's happening:

// Define the grammar
auto scheme_ = qi::copy(+qi::char_("a-zA-Z0-9+.-") >> "://");
auto host_   = qi::copy(+qi::char_("a-fA-F0-9:."));
auto port_   = qi::copy(':' >> +qi::digit);

qi::rule<std::string::const_iterator, UrlParts()> const url =
    -scheme_ >> ('[' >> host_ >> ']' | host_) >> -port_ >> *qi::char_;

So I went on to create a test-bed to demonstrate your question examples:

Note I simplified the handling by adding raw[] to include :// and just returning and printing UrlParts because it is more insightful to see what the parser does

Live On Coliru

// #define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>
#include <boost/pfr/io.hpp>

struct UrlParts { std::string scheme, host, port, path; };
BOOST_FUSION_ADAPT_STRUCT(UrlParts, scheme, host, port, path)

UrlParts parseUrl_BoostSpirit(std::string_view input) {
    namespace qi = boost::spirit::qi;

    using It = std::string_view::const_iterator;
    qi::rule<It, UrlParts()> url;
    //using R = qi::rule<It, std::string()>;
    //R scheme_, host_, port_;
    auto scheme_ = qi::copy(qi::raw[+qi::char_("a-zA-Z0-9+.-") >> "://"]);
    auto host_   = qi::copy(+qi::char_("a-fA-F0-9:."));
    auto port_   = qi::copy(':' >> +qi::digit);
    url          = -scheme_ >> ('[' >> host_ >> ']' | host_) >> -port_ >> *qi::char_;

    // BOOST_SPIRIT_DEBUG_NODES((scheme_)(host_)(port_)(url));
    BOOST_SPIRIT_DEBUG_NODES((url));

    // Parse the input
    UrlParts parts;
    parse(input.begin(), input.end(), qi::eps > url > qi::eoi, parts);
    return parts;
}

int main() {
    using It        = std::string_view::const_iterator;
    using Exception = boost::spirit::qi::expectation_failure<It>;

    for (std::string_view input : {
             "https://[::ffff:192.168.1.1]:8080/path/to/resource",
             "https://::ffff:192.168.1.1/path/to/resource",
             "https://192.168.1.1:8080/path/to/resource",
         }) {
        try {
            auto parsed = parseUrl_BoostSpirit(input);
            // using boost::fusion::operator<<; // less clear output, without PFR
            // std::cout << std::quoted(input) << " -> " << parsed << std::endl;
            std::cout << std::quoted(input) << " -> " << boost::pfr::io(parsed) << std::endl;
        } catch (Exception const& e) {
            std::cout << std::quoted(input) << " EXPECTED " << e.what_ << " at "
                      << std::quoted(std::string_view(e.first, e.last)) << std::endl;
        }
    }
}

Prints:

"https://[::ffff:192.168.1.1]:8080/path/to/resource" -> {"https://", "::ffff:192.168.1.1", "8080", "/path/to/resource"}
"https://::ffff:192.168.1.1/path/to/resource" -> {"https://", "::ffff:192.168.1.1", "", "/path/to/resource"}
"https://192.168.1.1:8080/path/to/resource" -> {"https://", "192.168.1.1:8080", "", "/path/to/resource"}

The Problem

You already assessed the problem: :8080 matches the production for host_. I'd reason that the port specification is the odd one out because it must be the last before '/' or the end of input. In other words:

auto port_   = qi::copy(':' >> +qi::digit >> &('/' || qi::eoi));

Now you can do a negative look-ahead assertion in your host_ production to avoid eating port specifications:

auto host_   = qi::copy(+(qi::char_("a-fA-F0-9:.") - port_));

Now the output becomes

Live On Coliru

"https://[::ffff:192.168.1.1]:8080/path/to/resource" -> {"https://", "::ffff:192.168.1.1", "8080", "/path/to/resource"}
"https://::ffff:192.168.1.1/path/to/resource" -> {"https://", "::ffff:192.168.1.1", "", "/path/to/resource"}
"https://192.168.1.1:8080/path/to/resource" -> {"https://", "192.168.1.1", "8080", "/path/to/resource"}

Note that there are some inefficiencies and probably RFC violations in this implementation. Consider a static instance of the grammar. Also consider using X3.

Using X3 and Asio

I have a related answer here: What is the nicest way to parse this in C++?. It shows an X3 approach with validation using Asio's networking primitives.

Boost URL

Why roll your own?

UrlParts parseUrl(std::string_view input) {
    auto parsed = boost::urls::parse_uri(input).value();
    return {parsed.scheme(), parsed.host(), parsed.port(), std::string(parsed.encoded_resource())};
}

To be really pedantic and get the :// as well:

UrlParts parseUrl(std::string_view input) {
    auto parsed = boost::urls::parse_uri(input).value();
    assert(parsed.has_authority());
    return {
        parsed.buffer().substr(0, parsed.authority().data() - input.data()),
        parsed.host(),
        parsed.port(),
        std::string(parsed.encoded_resource()),
    };
}

This parses what you have and much more (fragment from the Reference Help Card):

enter image description here

The notable value is

  • conformance (yes this means that IPV6 requires [])
  • proper encoding and decoding
  • low allocation (many operations work exclusively on the source stringview)
  • maintenance (you don't need to debug/audit it yourself)

Live On Coliru

==== "https://[::ffff:192.168.1.1]:8080/path/to/resource" ====
 Spirit -> {"https://", "::ffff:192.168.1.1", "8080", "/path/to/resource"}
 URL    -> {"https://", "[::ffff:192.168.1.1]", "8080", "/path/to/resource"}
==== "https://::ffff:192.168.1.1/path/to/resource" ====
 Spirit -> {"https://", "::ffff:192.168.1.1", "", "/path/to/resource"}
 URL    -> leftover [boost.url.grammar:4]
==== "https://192.168.1.1:8080/path/to/resource" ====
 Spirit -> {"https://", "192.168.1.1", "8080", "/path/to/resource"}
 URL    -> {"https://", "192.168.1.1", "8080", "/path/to/resource"}
==== "https://192.168.1.1:8080/s?quey=param&other=more%3Dcomplicated#bookmark" ====
 Spirit -> {"https://", "192.168.1.1", "8080", "/s?quey=param&other=more%3Dcomplicated#bookmark"}
 URL    -> {"https://", "192.168.1.1", "8080", "/s?quey=param&other=more%3Dcomplicated#bookmark"}