Why this simple jparsec lexer fails?

491 Views Asked by At

I would write a simple lexer that recognises words without digits and numbers ignoring whitespaces.

I written the following code using jparsec v3.0:

final Parser<String> words = Patterns.isChar(CharPredicates.IS_ALPHA).many1().toScanner("word").source();
final Parser<String> nums = Patterns.isChar(CharPredicates.IS_DIGIT).many1().toScanner("num").source();
final Parser<Tokens.Fragment> tokenizer = Parsers.or(
        words.map(it -> Tokens.fragment(it, "WORD")),
        nums.map(it -> Tokens.fragment(it, "NUM")));
final Parser<List<Token>> lexer = tokenizer.lexer(Scanners.WHITESPACES);

But the following test fails with the exception org.jparsec.error.ParserException: line 1, column 7: EOF expected, 1 encountered. Instead, using the string "abc cd 123" the parsing is successful.

final List<Token> got = lexer.parse("abc cd123");
final List<Token> expected = Arrays.asList(
        new Token(0, 3, Tokens.fragment("abc", "WORD")),
        new Token(4, 2, Tokens.fragment("cd", "WORD")),
        new Token(6, 3, Tokens.fragment("123", "NUM")));
assertEquals(expected, got);

In your opinion what is wrong?

2

There are 2 best solutions below

0
On BEST ANSWER

The problem has been solved simply by making the delimiter optional:

tokenizer.lexer(Scanners.WHITESPACES.optional(null))
1
On

The following test pass:

public class SOTest {
  final Parser<String> words = Patterns.isChar(CharPredicates.IS_ALPHA).many1().toScanner("word").source();
  final Parser<String> nums = Patterns.isChar(CharPredicates.IS_DIGIT).many1().toScanner("num").source();
  final Parser<Tokens.Fragment> tokenizer = Parsers.or(
    words.map(it -> Tokens.fragment(it, "WORD")),
    nums.map(it -> Tokens.fragment(it, "NUM")));
  final Parser<List<Token>> lexer = tokenizer.lexer(Scanners.WHITESPACES);


  @Test public void test(){
    final List<Token> got = lexer.parse("abc cd 123");
    Asserts.assertArrayEquals(got.toArray(new Token[0]),
      new Token(0, 3, Tokens.fragment("abc", "WORD")),
      new Token(4, 2, Tokens.fragment("cd", "WORD")),
      new Token(7, 3, Tokens.fragment("123", "NUM")));
  }      
}

Your tokens are either only ALPHA characters or only DIGITS, hence it is normal you can't parse abc cd123.

The fact that documentation says "delimiters are ignored before or after each occurence" should be interpreted in the sense that delimiters appearing before or after the list of Tokens parsed are ignored. But delimiters are not ignored to separate tokens, except in the case of operators (see Terminals class) for more info).