Pyparsing DOT record label (recursion)

132 Views Asked by At

I am trying to use pyparsing to parse the label used for DOT records. The syntax is:

rlabel → '{' field ( '|' field )* '}'
field → boxLabel | rlabel
boxLabel → [ ’<’ string ’>’ ] [ string ]

I have two issues. One, I am not sure how to handle boxLabel because it has to optionals. Two, I think I have a left recursion but since there is no "operator" not sure how to deal with it. What I've done so far:

langle = pyparsing.Literal('<')
rangle = pyparsing.Literal('>')
string = pyparsing.Word('alphanums')
port_id = pyparsing.Group(langle + string + rangle)
port_name = pyparsing.Group(string)
box_label = pyparsing.Group(pyparsing.Optional(port_id) + pyparsing.Optional(port_name))
rlabel = pyparsing.Forward()
field = box_label | rlabel
rlabel_content = pyparsing.ZeroOrMore(field + "|") + field
rlabel << pyparsing.nestedExpr(opener='{', content=rlabel_content, closer='}')

Parsing goes into infinite recursion.

Further,

>> print(box_label.parseString("<h> root"))
[[['<', 'h', '>']]]

and I have no idea why the port_name parsing is not present in the results.

1

There are 1 best solutions below

2
On BEST ANSWER

Congrats on making this much progress on your pyparsing project, recursive grammars are tricky to get your head around the first time you do them.

First, a minor typo, but one which is probably puzzling you in that it creates a valid parser, but a strangely limited one.

string = pyparsing.Word('alphanums')

should be

string = pyparsing.Word(pyparsing.alphanums)

When you pass a string to Word, that defines the list of valid characters to use when parsing such a character group. So Word("0123456789") will parse an integer, Word("AB") will parse any word made up of the letters "A" and "B". Word('alphanums') will parse "alpha", "nums", "plums", "haphalump" and many other words, as long as they are made up of the letters in 'alphanums'. I'm pretty sure you want the pyparsing-defined string pyparsing.alphanums, which includes the characters 'A'-'Z', 'a'-'z', and '0'-'9'.

Let's tackle the box_label question first. In general, you can implement "A and B or just A or just B" as:

A + Optional(B) | B

Or if you want to be more explicit, you can do:

A + B | A | B

But this starts to get hairy when you have to work with more than 2 expressions. Fortunately, you only have 2, and I chose the first format for my proposed solution below.

Part of the problem you are having with recursion is that you are using both Forward and nestedExpr - usually just one or the other is sufficient. But in your case, there is an advantage to using Forward, so I'll use that method.

(I've also imported pyparsing as pp, to cut down on all the extra chars.)

import pyparsing as pp

# suppress these punctuation marks - useful during parsing, but just in the way afterward
langle = pp.Literal('<').suppress()
rangle = pp.Literal('>').suppress()
lbrace = pp.Literal('{').suppress()
rbrace = pp.Literal('}').suppress()

string = pp.Word(pp.alphanums)

port_id = langle + string("port_id") + rangle
port_name = string("port_name")

rlabel = pp.Forward()
box_label = pp.Group(port_id + pp.Optional(port_name) | port_name)
field = box_label | rlabel
# use delimitedList(expr) in place of expr + ZeroOrMore(delim + expr)
rlabel <<= pp.Group(lbrace + pp.delimitedList(field, delim='|') + rbrace)

Create yourself a set of test strings, and call runTests with them to see how your parser behaves:

field.runTests("""\
    <h> root
    <h>
    root
    { <h> root | { <h2> sub | <h2>sub2 }}
    { <h> root | { <h2> sub | <h2>sub2 } | sub3 }
""")

runTests will try to parse each given line, and then either dump the results, or show the parse exception:

<h> root
[['h', 'root']]
[0]:
  ['h', 'root']
  - port_id: 'h'
  - port_name: 'root'


<h>
[['h']]
[0]:
  ['h']
  - port_id: 'h'


root
[['root']]
[0]:
  ['root']
  - port_name: 'root'


{ <h> root | { <h2> sub | <h2>sub2 }}
[[['h', 'root'], [['h2', 'sub'], ['h2', 'sub2']]]]
[0]:
  [['h', 'root'], [['h2', 'sub'], ['h2', 'sub2']]]
  [0]:
    ['h', 'root']
    - port_id: 'h'
    - port_name: 'root'
  [1]:
    [['h2', 'sub'], ['h2', 'sub2']]
    [0]:
      ['h2', 'sub']
      - port_id: 'h2'
      - port_name: 'sub'
    [1]:
      ['h2', 'sub2']
      - port_id: 'h2'
      - port_name: 'sub2'


{ <h> root | { <h2> sub | <h2>sub2 } | sub3 }
[[['h', 'root'], [['h2', 'sub'], ['h2', 'sub2']], ['sub3']]]
[0]:
  [['h', 'root'], [['h2', 'sub'], ['h2', 'sub2']], ['sub3']]
  [0]:
    ['h', 'root']
    - port_id: 'h'
    - port_name: 'root'
  [1]:
    [['h2', 'sub'], ['h2', 'sub2']]
    [0]:
      ['h2', 'sub']
      - port_id: 'h2'
      - port_name: 'sub'
    [1]:
      ['h2', 'sub2']
      - port_id: 'h2'
      - port_name: 'sub2'
  [2]:
    ['sub3']
    - port_name: 'sub3'