Greedy expressions in Pyparsing

563 Views Asked by At

I'm trying to split a string like aaa:bbb(123) into tokens using Pyparsing.

I can do this with regular expression, but I need to do it via Pyparsing.

With re the solution will look like:

>>> import re
>>> string = 'aaa:bbb(123)'
>>> regex = '(\S+):(\S+)\((\d+)\)'
>>> re.match(regex, string).groups()
('aaa', 'bbb', '123')

This is clear and simple enough. The key point here is \S+ which means "everything except whitespaces".

Now I'll try to do this with Pyparsing:

>>> from pyparsing import Word, Suppress, nums, printables
>>> expr = (
...     Word(printables, excludeChars=':')
...     + Suppress(':')
...     + Word(printables, excludeChars='(')
...     + Suppress('(')
...     + Word(nums)
...     + Suppress(')')
... )
>>> expr.parseString(string).asList()
['aaa', 'bbb', '123']

Okay, we've got the same result, but this does not look good. We've set excludeChars to make Pyparsing expressions to stop where we need, but this doesn't look robust. If we will have "excluded" chars in source string, same regex will work fine:

>>> string = 'a:aa:b(bb(123)'
>>> re.match(regex, string).groups()
('a:aa', 'b(bb', '123')

while Pyparsing exception will obviously break:

>>> expr.parseString(string).asList()
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/long/path/to/pyparsing.py", line 1111, in parseString
    raise exc
ParseException: Expected W:(0123...) (at char 7), (line:1, col:8)

So, the question is how can we implement needed logic with Pyparsing?

2

There are 2 best solutions below

0
On

Unlike regex, pyparsing is purely left-to-right seeking, with no implicit lookahead.

If you want regex's lookahead and backtracking, you could just use a Regex containing your original re:

expr = Regex(r"(\S+):(\S+)\((\d+)\)")
print expr.parseString(string).dump()

['aaa:b(bb(123)']

However, I see that this returns just the whole match as a single string. If you want to be able to access the individual groups, you'll have to define them as named groups:

expr = Regex(r"(?P<field1>\S+):(?P<field2>\S+)\((?P<field3>\d+)\)")
print expr.parseString(string).dump()

['aaa:b(bb(123)']
- field1: aaa
- field2: b(bb
- field3: 123    

This suggests to me that a good enhancement would be to add a constructor arg to Regex to return the results as a list of all the re groups rather than the string.

0
On

Use a regex with a look-ahead assertion:

from pyparsing import Word, Suppress, Regex, nums, printables

expr = (
     Word(printables, excludeChars=':')
     + Suppress(':')
     + Regex(r'\S+[^\(](?=\()')
     + Suppress('(')
     + Word(nums)
     + Suppress(')')
 )