I am parsing a file with a format that can include:
INT32 price min 10 max 100 alertIfSold ;
The min, max and alertIfSold tokens are all optional and can appear in any order. That is
INT32 price max 100 alertIfSold ;
INT32 price max 100 min 10 alertIfSold ;
INT32 price alertIfSold ;
INT32 price;
are all valid examples.
Below is a simple version of the grammar I'm testing. running python test.py generate this error:
lark.common.ParseError: Infinite recursion detected! (rule <__anon_star_1 : __anon_star_1>)
I've tried expressing the same optional tokens using other grammar rules, with similar results (Infinite recursion).
What is the proper grammar to express the optional parameters?
#test.py
from lark import lark
simplified_grammar = """
start: line+
line: TYPE CNAME [MIN MAX ALERT]* ";" -> foo
TYPE: "INT32" | "INT64"
MIN: "min" /[0-9]+/
MAX: "max" /[0-9]+/
ALERT: "alertIfSold"
%import common.CNAME
%import common.WS
%ignore WS
"""
sample = """
INT32 price max 100 alertIfSold ;
INT32 price max 100 min 10 alertIfSold ;
INT32 price alertIfSold ;
INT32 price;
"""
parser = lark.Lark(simplified_grammar)
def main():
parse_tree = parser.parse(sample)
if __name__ == '__main__':
main()
You want:
(Note:
()
instead of[]
.)In lark's EBNF syntax,
[item]
means "an optionalitem
" anditem*
means "any number (possibly zero) ofitem
". So[item]*
means "any number (possibly zero) of eitheritem
or nothing". But "any number of nothing" is infinitely ambiguous; you cannot tell how many nothings there are in an empty string.Since you actually don't intend to require that the clauses appear in strict succession, you might have been thinking of
That would have been more accurate but it would also have produced the same error message. In general, you cannot use the Kleene star on a nullable subpattern. Some EBNF generators will correct that by removing ε from the repeated set (and then making the repetition as a whole optional), but lark is not one of those. In this case, the fix is trivial but there are other cases in which it is more annoying.
As regular expressions,
(a* b*)*
,(a? b?)*
and(a|b)*
are equivalent in the sense that they all recognise the same language. But regular expressions are notoriously ambiguous, and parsers usually prefer unambiguous grammars, or at worst finitely ambiguous grammars. Only the last regular expression falls into that category, and it is the form you should normally prefer.