Why does Terminals.tokenizer() tokenize unregistered operators/keywords?

261 Views Asked by At

I've just discovered the root cause of some very confusing behavior I was observing. Here is a test:

@Test
public void test2() {
    Terminals terminals = Terminals.caseInsensitive(new String[] {}, new String[] { "true", "false" });
    Object result = terminals.tokenizer().parse("d");
    System.out.println("Result: " +  result);
}

This outputs:

Result: d

I was expecting the parser returned by terminals.tokenizer() not to return anything because "d" is not a valid keyword or operator.

The reason I care is because I wanted my own parser at a lower priority than that returned by terminals.tokenizer():

public static final Parser<?> INSTANCE =
        Parsers.or(
                STRING_TOKENIZER,
                NUMBER_TOKENIZER,
                WHITESPACE_TOKENIZER,
                (Parser<Token>)TERMINALS.tokenizer(),
                IDENTIFIER_TOKENIZER);

The IDENTIFIER_TOKENIZER above is never used because TERMINALS.tokenizer() always matches.

Why does Terminals.tokenizer() tokenize unregistered operators/keywords? And how might I get around this?

3

There are 3 best solutions below

0
On

In the upcoming jParsec 2.2 release, the API makes it more clear what Terminals does: http://jparsec.github.io/jparsec/apidocs/org/codehaus/jparsec/Terminals.Builder.html

You cannot even define your keywords without first providing a scanner that defines "words".

The implementation first uses the provided word scanner to find all words, and then identifies the special keywords on the scanned words.

So, why does it do it this way?

  1. If you didn't need case insensitivity, you could have passed the keywords as "operators". Yes, you read it right. One can equally use Terminals.token(op) or Terminals.token(keyword) to get the token level parser of them. What distinguishes operators from keywords is just that keywords are "special" words. Whether they happen to be alphabet characters or other printable characters is just by convention.
  2. Another way to do it, is to define your word scanner precisely as Parsers.or(Scanners.string("keyword1"), Scanners.string("keyword2"), ...). Then Terminals won't try to tokenize anything else.
  3. The above assumes that you want to do the 2-phase parsing. But that's optional. Your test shows that you weren't feeding the tokenizer to a token-level parser using Parser.from(tokenizer, delim). If two-phase parsing isn't needed, it can be as simple as: or(stringCaseInsensitive("true"), stringCaseInsensitive("false"))

More on point 3. The 2-phase parsing creates a few extra caveats in jParsec that you don't find in other parser combinators like Haskell's Parsec. In Haskell, a string is no different from a list of character. So there really isn't anything to gain by special casing them. many(char 'x') parses a string just fine.

In Java, String isn't a List or array of char. It would be very inefficient if we take the same approach and box each character into a Character object so that the character level and token level parsers can be unified seamlessly.

Now that explains why we have character level parsers at all. But it's completely optional to use the token level parsers (By that, I mean Terminals, Parser.from(), Parser.lexer() etc).

You could create a fully functional parser with only character-level parsers, a.k.a scanners.

For example: Scanners.string("true").or(Scanners.string("false")).sepEndBy1(delim)

2
On

From the documentation of Tokenizer#caseInsensitive:

org.codehaus.jparsec.Terminals

public static Terminals caseInsensitive(String[] ops, String[] keywords)

Returns a Terminals object for lexing and parsing the operators with names specified in ops, and for lexing and parsing the keywords case insensitively. Keywords and operators are lexed as Tokens.Fragment with Tokens.Tag.RESERVED tag. Words that are not among keywords are lexed as Fragment with Tokens.Tag.IDENTIFIER tag. A word is defined as an alphanumeric string that starts with [_a - zA - Z], with 0 or more [0 - 9_a - zA - Z] following.

Actually, the result returned by your parser is a Fragment object which is tagged according to its type. In your case, d is tagged as IDENTIFIER which is expected.

It is not clear to me what you want to achieve though. Could you please provide a test case ?

0
On

http://blog.csdn.net/efijki/article/details/46975979

The above blog post explains how to define your own tag. I know it's in Chinese. You just need to see the code. Especially the withTag() and patchTag() part.