I've just discovered the root cause of some very confusing behavior I was observing. Here is a test:
@Test
public void test2() {
Terminals terminals = Terminals.caseInsensitive(new String[] {}, new String[] { "true", "false" });
Object result = terminals.tokenizer().parse("d");
System.out.println("Result: " + result);
}
This outputs:
Result: d
I was expecting the parser returned by terminals.tokenizer()
not to return anything because "d" is not a valid keyword or operator.
The reason I care is because I wanted my own parser at a lower priority than that returned by terminals.tokenizer()
:
public static final Parser<?> INSTANCE =
Parsers.or(
STRING_TOKENIZER,
NUMBER_TOKENIZER,
WHITESPACE_TOKENIZER,
(Parser<Token>)TERMINALS.tokenizer(),
IDENTIFIER_TOKENIZER);
The IDENTIFIER_TOKENIZER
above is never used because TERMINALS.tokenizer()
always matches.
Why does Terminals.tokenizer()
tokenize unregistered operators/keywords? And how might I get around this?
In the upcoming jParsec 2.2 release, the API makes it more clear what Terminals does: http://jparsec.github.io/jparsec/apidocs/org/codehaus/jparsec/Terminals.Builder.html
You cannot even define your keywords without first providing a scanner that defines "words".
The implementation first uses the provided word scanner to find all words, and then identifies the special keywords on the scanned words.
So, why does it do it this way?
Terminals.token(op)
orTerminals.token(keyword)
to get the token level parser of them. What distinguishes operators from keywords is just that keywords are "special" words. Whether they happen to be alphabet characters or other printable characters is just by convention.Parsers.or(Scanners.string("keyword1"), Scanners.string("keyword2"), ...)
. Then Terminals won't try to tokenize anything else.Parser.from(tokenizer, delim)
. If two-phase parsing isn't needed, it can be as simple as:or(stringCaseInsensitive("true"), stringCaseInsensitive("false"))
More on point 3. The 2-phase parsing creates a few extra caveats in jParsec that you don't find in other parser combinators like Haskell's Parsec. In Haskell, a string is no different from a list of character. So there really isn't anything to gain by special casing them.
many(char 'x')
parses a string just fine.In Java, String isn't a List or array of char. It would be very inefficient if we take the same approach and box each character into a Character object so that the character level and token level parsers can be unified seamlessly.
Now that explains why we have character level parsers at all. But it's completely optional to use the token level parsers (By that, I mean
Terminals
,Parser.from()
,Parser.lexer()
etc).You could create a fully functional parser with only character-level parsers, a.k.a scanners.
For example:
Scanners.string("true").or(Scanners.string("false")).sepEndBy1(delim)