Tokenize a string excluding delimiters inside quotes

2k Views Asked by At

First let me say, I have gone thoroughly through all other solutions to this problem on SO, and although they are very similar, none fully solve my problem.

I need a to extract all tokens excluding quotes (for the quoted ones) using boost regex.

The regex I think I need to use is:

sregex pattern = sregex::compile("\"(?P<token>[^\"]*)\"|(?P<token>\\S+)");

But I get an error of:

named mark already exists

The solution posted for C# seems to work with a duplicate named mark given that it is an OR expression with the other one.

Regular Expression to split on spaces unless in quotes

3

There are 3 best solutions below

0
On BEST ANSWER

I answered a very similar question here:

How to make my split work only on one real line and be capable to skip quoted parts of string?

The example code

  • uses Boost Spirit
  • supports quoted strings, partially quoted fields, user defined delimiters, escaped quotes
  • supports many (diverse) output containers generically
  • supports models of the Range concept as input (includes char[], e.g.)

Tested with a relatively wide range of compiler versions and Boost versions.

https://gist.github.com/bcfbe2b5f071c7d153a0

2
On

Most regex flavors don't allow group names to be reused. Some flavors permit it if all the uses are within the same alternation, but apparently yours isn't one of them. However, if you're running a recent enough version of Boost, you should be able to use a branch-reset group. It looks this - (?|...|...|...) - and within each alternative the group numbering resets to wherever it was before the branch-reset group was reached. It should work with named groups, too, but that's not guaranteed. I'm not in a position to test it myself, so try this:

"(?|\"(?P<token>[^\"]*)\"|(?P<token>\\S+))"

...and if that doesn't work, try it with plain old numbered groups.

0
On

While looking through the answers here I tested another method, which involves using different group mark names and simply testing which one is blank when iterating through them. While it is probably not the fastest code, it is the most readable solution so far, which is more important for my problem.

Here is the code that worked for me:

    #include <boost/xpressive/xpressive.hpp>
    using namespace boost::xpressive;
...
    std::vector<std::string> tokens;
    std::string input = "here is a \"test string\"";
    sregex pattern = sregex::compile("\"(?P<quoted>[^\"]*)\"|(?P<unquoted>\\S+)");
    sregex_iterator cur( input.begin(), input.end(), pattern );
    sregex_iterator end;

    while(cur != end)
    {
      smatch const &what = *cur;
      if(what["quoted"].length() > 0)
      {
        tokens.push_back(what["quoted"]);
      }
      else
      {
        tokens.push_back(what["unquoted"]);
      }
      cur++;
    }