Lark how to describe a series of optional tokens

1.7k Views Asked by At

I am parsing a file with a format that can include:

INT32  price   min 10  max 100   alertIfSold ; 

The min, max and alertIfSold tokens are all optional and can appear in any order. That is

INT32  price    max 100   alertIfSold ; 
INT32  price  max 100   min 10    alertIfSold ;
INT32  price  alertIfSold ;
INT32  price; 

are all valid examples.

Below is a simple version of the grammar I'm testing. running python test.py generate this error:

lark.common.ParseError: Infinite recursion detected! (rule <__anon_star_1 : __anon_star_1>)

I've tried expressing the same optional tokens using other grammar rules, with similar results (Infinite recursion).

What is the proper grammar to express the optional parameters?

#test.py
from lark import lark

simplified_grammar = """
    start: line+
    line:  TYPE  CNAME [MIN MAX ALERT]* ";"    -> foo

     TYPE: "INT32" | "INT64"

     MIN: "min" /[0-9]+/
     MAX: "max" /[0-9]+/
     ALERT: "alertIfSold"

     %import common.CNAME
     %import common.WS
     %ignore WS
  """

sample = """
    INT32  price    max 100   alertIfSold ; 
    INT32  price  max 100   min 10    alertIfSold ;
    INT32  price  alertIfSold ;
    INT32  price; 

"""

parser = lark.Lark(simplified_grammar)


def main():
    parse_tree = parser.parse(sample)

if __name__ == '__main__':
    main()
1

There are 1 best solutions below

1
On BEST ANSWER

You want:

line:  TYPE  CNAME (MIN | MAX | ALERT)* ";"    -> foo

(Note: () instead of [].)

In lark's EBNF syntax, [item] means "an optional item" and item* means "any number (possibly zero) of item". So [item]* means "any number (possibly zero) of either item or nothing". But "any number of nothing" is infinitely ambiguous; you cannot tell how many nothings there are in an empty string.

Since you actually don't intend to require that the clauses appear in strict succession, you might have been thinking of

line:  TYPE  CNAME ([MIN] [MAX] [ALERT])* ";"    -> foo

That would have been more accurate but it would also have produced the same error message. In general, you cannot use the Kleene star on a nullable subpattern. Some EBNF generators will correct that by removing ε from the repeated set (and then making the repetition as a whole optional), but lark is not one of those. In this case, the fix is trivial but there are other cases in which it is more annoying.

As regular expressions, (a* b*)*, (a? b?)* and (a|b)* are equivalent in the sense that they all recognise the same language. But regular expressions are notoriously ambiguous, and parsers usually prefer unambiguous grammars, or at worst finitely ambiguous grammars. Only the last regular expression falls into that category, and it is the form you should normally prefer.