I have read the RFC on the ABNF specification and am having difficulty understanding how a set of ABNF rules could be used to reliably extract tokens from some input string that matches the grammar. It seems that the specification doesn't ever mention tokens or ASTs, so it may not concern itself with that, but I believe that would be the ultimate goal of applying any BNF grammar, unless I am mistaken.
In the specification, they list example rules for parsing a postal-address:
postal-address = name-part street zip-part
name-part = *(personal-part SP) last-name [SP suffix] CRLF
name-part =/ personal-part CRLF
personal-part = first-name / (initial ".")
first-name = *ALPHA
initial = ALPHA
last-name = *ALPHA
suffix = ("Jr." / "Sr." / 1*("I" / "V" / "X"))
street = [apt SP] house-num SP street-name CRLF
apt = 1*4DIGIT
house-num = 1*8(DIGIT / ALPHA)
street-name = 1*VCHAR
zip-part = town-name "," SP state 1*2SP zip-code CRLF
town-name = 1*(ALPHA / SP)
state = 2ALPHA
zip-code = 5DIGIT ["-" 4DIGIT]
There is also a list of core rules that I won't post here describing expected common-usage rules.
Ultimately, what I would like to do is figure out the rules necessary for taking the input
John H. Doe
12345 Fakestreet
Springfield, IL 55555
and generating what I believe would be the correct token sequence which is:
["John"
, " "
, "H"
, "."
, "Doe"
, "\r\n"
,
"12345"
, " "
, "Fakestreet"
, "\r\n"
,
"Springfield"
, ","
, " "
, "IL"
, " "
, "55555"
, "\r\n"
]
(I believe the spaces and CRLFs need to be returned as "tokens" because they are specified as requirements in certain rules)
Some problems I am considering:
- It makes sense that "Fakestreet" should be its own token, but according to the definition it is a variable repetition of the visible-character core rule. Ideally I would not like to read out each letter as its own token ("F", "a", "k", and so on), so (assuming core-rules can be treated as terminals?) any potential token string would need to be checked against the entire, theoretically infinite, rule definition
1*VCHAR
to see if it is a match. And some rules are more complicated than that, like zip-code's5DIGIT ["-" 4DIGIT]
, but any potential token needs to be checked against this rule as well ("12345" and "12345-6789" are both valid tokens). So it seems like entire rule element concatenations need to be checked completely as well, unless "12345-6789" should rather be tokenized as ["12345"
,"-"
,"6789"
] which... may be correct? - I'd assume we would not want to completely check rules that reference other rules, otherwise we may end up tokenizing the entire postal-address as a single token of type "postal-address". Maybe rules that reference other rules shouldn't be checked? Maybe there is such a thing as a "terminal-rule" that includes no rule refs (excluding core rules)?
- Occasionally in the rules, terminal values are combined with rule references, for instance in the definition of "personal-part", the literal "." is defined. So, while we may not want to match any potential token string against the entire "personal-part" rule definition, it seems we do want to try to match it against the literal "." because it is a required token for parsing a personal-part. Maybe in non-terminal rules, terminal values listed there should be considered?
I realize this is a lengthy question, but it seems BNF supersets like EBNF and ABNF are being used for this kind of thing but I cannot find a standard specification for how to tokenize from ABNF grammar.