What is the magic behind WS in ANTLR?

1.4k Views Asked by At

I'm trying to make a tool like ANTLR from scratch in Swift (just for fun). But I don't understand how grammar knows that there should be no whitespaces (identifier example: "_myIdentifier123"):

Identifier
 : Identifier_head Identifier_characters?

And there should be whitespaces (example "is String"):

type_casting_operator
  : 'is' type
  | 'as' type
  | 'as' '?' type
  | 'as' '!' type
  ;

I've searched for WS in ANTLR's source code, but found nothing. There is no "WS" string in java code: https://github.com/antlr/antlr4

Can anyone explain the algorithm behind this? How it decides whether tokens are separated with whitespaces or not?

2

There are 2 best solutions below

2
Henry On

The first rule is a lexer rule (note the capital first letter), while the second rule is a parser rule.

The white space token typically is not passed to the parser (in this case there must be a rule to skip white space in the lexer), so the second rule does not see it. Whitespace can appear anywhere between other tokens.

Lexer rules in contrast see all characters from the input, so any white space must be matched explicitly.

3
Mike Lischke On

Good luck with that project. Without knowing even the most basic algorithms this non-trivial task of creating a parser generator becomes even more ambitious. You should at least read a book or two about the matter (a classic is the Dragon Book, from Aho, Sethi + Ullmann).

But back to your question. The principle is that: whitespaces need to be handled like any other input, but usually you will find a WS or Whitespace lexer rule in the grammar which matches various types of whitespaces (space, line breaks, tab etc.) and puts them on a hidden channel. The parser only sees tokens from the standard channel and hence never gets the whitespaces as tokens. This is the most common approach because the existance of whitespaces usually doesn't matter (except for separating two lexical entries that need to be recognized as 2 different tokens).