I'm trying to parse a simple text like this:
test abc
Lark grammar is here:
start: test
test: "test" _WSI name _NL
name: (LETTER | DIGIT | "_")+
%import common.WS_INLINE -> _WSI
%import common.NEWLINE -> _NL
%import common.LETTER
%import common.DIGIT
Now if I print and pretty_print it, 'name' is split into separate tokens:
Tree(Token('RULE', 'start'), [Tree(Token('RULE', 'test'), [Tree(Token('RULE', 'name'), [Token('LETTER', 'a'), Token('LETTER', 'b'), Token('LETTER', 'c')])])])
start
test
name
a
b
c
Why? I want to have that name as a string, not separate characters...
What is happening
Lark uses different naming conventions for Nonterminals and Terminals. To define a nonterminal you use lower-case. While if you wanted to define a terminal you should use UPPER-CASE.
Due to this distinction, lark will read the token
nameas a nonterminal, with production rule:This will result in a tree structure as you have seen in your output:
How to solve
As you actually are not interested in the tree structure and want to have a single token, you actually want to make a Terminal instead. This means changing the token
nametoNAMEinstead. When doing so, you tell lark to read the matched input as a single token, a terminal. Applying this change results in the grammar:Running this will result in the following output:
Which solves the issue you have encountered.