I have strings with words like this:
- "ABC Some other stuff" (normaly there is a short letter combination at the beginning)
- "Some other stuff" (sometimes there is nothing interesting)
- "HFG 54 Some other stuff and even more" (sometimes there is an interesting number)
- "HFG 54 ZZ Some other stuff and even more" (sometimes there is ZZ after the number)
- "HFG-54 ZZ Some other stuff and even more" (soemtimes there is a dash)
- "ZT SOME OTHER STUFF" (The rest can be capitalized too)
- "(ZT) Some other stuff" (The part can be in brackets)
- "68 Some other stuff " (There can only be a number)
- "Some other stuff DFG" (can be at the end)
I made some rules to parse it and with larks Early-parser it works fine. Now i want to try it out with the Lalr-Parser but then the special words are not recognized. I am interested in the capitalized letter combinations and the numbers. I have a list of the possible letter combinations. The numbers are always two digits. The rest of the string can be anything.
I am using lark. This is my parser:
from lark import Lark
tryoutabc = Lark(r"""
start: mat_all_and_restl
?mat_all_and_restl: mat_all restl | restl | restl mat_br
restl: REST*
?mat_all: mat_br| mat_num| num
mat_num: mat_br ("-")? NUM
?num: NUM ("ZZ")?
NUM: /\d{2}/
?mat_br: "("? MAT ")"?
MAT: "ABC"
| "HFG"
| "ZT"
| "DFG"
REST: /([\-])?[A-Za-zÜ]+([\/.][A-Za-zÜ]+)?/
%import common.WS
%ignore WS
""", start='start', parser='lalr') #remove "parser='lalr'" and the results are right
How do i have to change the rules to make it possible with the lalr-parser to parse this?
To try it out:
textl = ["ABC SOME OTHER STUFF", "Some other stuff", "HFG 54 Some other stuff and even more",
"HFG 54 ZZ Some other stuff and even more", "HFG-54 Some other stuff and even more", "ZT Some other stuff",
"(ZT) Some other stuff", "Some other stuff DFG", "Some other stuff (DFG)", "68 Some other stuff "]
for text in textl:
print(tryoutabc.parse(text).pretty())
I get
start
restl
ABC
SOME
OTHER
STUFF
but i want
start
ABC
restl
SOME
OTHER
STUFF
For "HFG 54 Some other stuff and even more"
i get an error:
HFG 54 Some other stuff and even more
^
Expecting: {'REST'}`
but i want:
start
mat_num
HFG
54
restl
Some
other
stuff
and
even
more
In reality the strings are longer and look like this "Interesting stuff i already parsed ABC Some other stuff" and i already parsed the stuff at the beginning of the strings and it works pefect.
From the comments it appears that it is not possible to do that because i dont have a context free language here and apparently larl can only do cfg-languages. If somebody adds and answer where this both is explained quick i would be happy.
The problem is the
REST
terminal. It can parse almost anything by design, meaning thatand
Are both valid interpretations of the first example sentences (
ABC SOME OTHER STUFF
). This is something you can take a look at if you passambiguity='explicit'
to an earley parser.This example works with
lalr
, since there isn't a problem iflalr
just interprets everything asREST
, but it does give you the wrong result.The problematic example
"HFG 54 Some other stuff and even more"
on the other hand doesn't work. It takesHFG
asREST
, and then tries to parse54
which it can since it isn'tREST
and it doesn't expect aNUM
right now.If this is the full extended of you grammar, use
earley
, or for even more speed than lalr could provided, a (or multiple) regex might also do the job quite well.(Note, I misspoke: this is potentially a CFG, just ambiguous. You could probably continue to hack on the grammar/the
REST
regex to make it work, but it won't be pretty.)