Lark Parser: No terminal defined for ':' (Seeming bias against colon character ":")

1.2k Views Asked by At

I have the following rule (taken from SMTP - RFC5321):

!path : "<" [ a_d_l ":" ] mailbox ">"

When I try to parse this line:

<test.com:[email protected]>

I get the following error:

No terminal defined for ':'

What's unusual is that if I simply change the ":" for "_", it somehow works:

!path : "<" [ a_d_l "_" ] mailbox ">"
<[email protected]>

What also works is a line not including that part [ a_d_l ":" ] (which is optional as indicated by [])

!path : "<" [ a_d_l ":" ] mailbox ">"
<[email protected]>

I already tried to define a terminal rule for the colon but this did not work either:

!path : "<" [ a_d_l COLON ] mailbox ">"
COLON : ":"
<[email protected]>

Minimal reproducible example:

As requested in the comments.

from lark import Lark

grammar = r'''
!path               : "<" [ a_d_l ":" ] mailbox ">"
a_d_l               : at_domain ( "," at_domain )*    
at_domain           : "@" domain

domain                  : sub_domain ("." sub_domain)*
sub_domain              : let_dig [ldh_str]
let_dig                 : ALPHA | DIGIT
!ldh_str                 : ( ALPHA | DIGIT | "-" )* let_dig
address_literal         : "[" ( ipv4_address_literal | ipv6_address_literal | general_address_literal ) "]"
ipv4_address_literal    : snum ("."  snum)~3
snum                    : DIGIT~1..3
ipv6_address_literal    : "ipv6:" ipv6_addr
ipv6_addr               : ipv6_full | ipv6_comp | ipv6v4_full | ipv6v4_comp
ipv6_full               : ipv6_hex (":" ipv6_hex)~7
ipv6_hex                : HEXDIG~1..4
!ipv6_comp               : [ipv6_hex (":" ipv6_hex)~0..5] "::" [ipv6_hex (":" ipv6_hex)~0..5]
!ipv6v4_full             : ipv6_hex (":" ipv6_hex)~5 ":" ipv4_address_literal
!ipv6v4_comp             : [ipv6_hex (":" ipv6_hex)~0..3] "::" [ipv6_hex (":" ipv6_hex)~0..3 ":"] ipv4_address_literal
!general_address_literal : standardized_tag ":" dcontent+
standardized_tag        : ldh_str
dcontent                : /[\x21-\x5A|\x5E-\x7E]/

mailbox        : local_part /[\x40]/ ( domain | address_literal ) 
local_part     : dot_string | quoted_string 

dot_string     : atom ("."  atom)*
atom           : atext+
quoted_string  : /[\x22]/ qcontentsmtp* /[\x22]/
qcontentsmtp   : qtextsmtp | quoted_pairsmtp
quoted_pairsmtp  : /[\x5C\x5C]/ /[\x20-\x7E]/
qtextsmtp      : /[\x20-\x21|\x23-\[\]-\x7E]/
atext          : /[\x21|\x23-\x27|\x2A|\x2B|\x2D|\x2F-\x39|\x3D|\x3F|\x41-\x5A|\x5E-\x7E]/

command : [ path ]
%import common.WS       -> SP
%import common.NEWLINE  -> CRLF
%import common.DIGIT
%import common.LETTER   -> ALPHA
%import common.HEXDIGIT -> HEXDIG'''

input = "<test.com:[email protected]>"

try:
    result = Lark(grammar, start="command").parse(input)
except Exception as ex:
    print('####### Parsing Failed')
    print(ex)
    traceback.print_exc()
    result = None
return result
1

There are 1 best solutions below

0
On BEST ANSWER
!path               : "<" [ a_d_l ":" ] mailbox ">"
a_d_l               : at_domain ( "," at_domain )*    
at_domain           : "@" domain

Will only match "<@test.com:[email protected]>". It cannot match "<test.com:[email protected]>" because It doesn't start with "<" at_domain or "<" mailbox.