Nearley Tokenizers vs Rules

469 Views Asked by At

I'm pretty new to nearly.js, and I would like to know what tokenizers/lexers do compared to rules, according to the website:

By default, nearley splits the input into a stream of characters. This is called scannerless parsing. A tokenizer splits the input into a stream of larger units called tokens. This happens in a separate stage before parsing. For example, a tokenizer might convert 512 + 10 into ["512", "+", "10"]: notice how it removed the whitespace, and combined multi-digit numbers into a single number.

Wouldn't that be the same as:

Math -> Number _ "+" _ Number
Number -> [0-9]:+

I don't see what the purpose of lexers are. I see that rules are always useable in this case and there is no need for lexers.

1

There are 1 best solutions below

0
On

After fiddling around with them, I found out the use of tokenizers, say we had the following:

Keyword -> "if"|"else"
Identifier -> [a-zA-Z_]+

This won't work, if we try compiling this, we get ambiguous grammar, "if" will be matched as both a keyword and an Identifier, a tokenizer however:

{
"keyword": /if|else/
"identifier": /[a-zA-Z_]+/
}

Trying to compile this will not result in ambiguous grammar, because tokenizers are smart (at least the one shown in this example, which is Moo).