ANTL 4 Lexer Rules not recognized in combined grammar

479 Views Asked by At

So I'm working on a combined grammar in ANTLR4 using ANTLRWorks 2.1. I have the lexer rules Identifier and Block that are not being recognized as defined lexer rules, but only in the last parser rule defined. Adding a literal after these rules removes (or hides) these errors.

My grammar with the error at the end (italicized tokens are throwing the error):

grammar GCombined;

options { language = Cpp; }

@lexer::namespace{AntlrTest01}
@parser::namespace{AntlrTest01}

    /* First Lexer Stage */

Bit: '0' | '1';
Digit : '0'..'9';
ODigit: '0'..'7';
XDigit: '0'..'f';
Letter: ('a'..'z') | ('A'..'Z');
Symbol: '|'
      | '-'
      | '!'
      | '#'
      | '$'
      | '%'
      | '&'
      | '('
      | ')'
      | '*'
      | '+'
      | ','
      | '-'
      | '.'
      | '/'
      | ':'
      | ';'
      | '<'
      | '='
      | '>'
      | '?'
      | '@'
      | '['
      | ']'
      | '^'
      | '_'
      | '`'
      | '{'
      | '|'
      | '}'
      | '~';
WSpace: ( ' '
        | '\t'
        | '\r'
        | '\n'
        | '\c'
        | '\0'
        | '\u000C'
        )+ -> skip;

DNumber: Digit+;
ONumber: '0o' Digit+;
XNumber: '0x' Digit;
Integer: DNumber
       | ONumber
       | XNumber;
Float: DNumber '.' DNumber;

Character: Letter
         | Digit
         | Symbol
         | WSpace;
String: Character+;
Literal: '"' String '"';

Boolean: 'true' | 'false';

    /* Second Lexer Stage */

Number: Integer | Float;
Identifier: Letter (Letter | Digit | '_')+;
Keyword: Letter+;
Operator: '+'
        | '-'
        | '*'
        | '/'
        | '%'
        | '=='
        | '!='
        | '>'
        | '<'
        | '>='
        | '<='
        | '&&'
        | '||'
        | '^'
        | '&'
        | '|'
        | '<<'
        | '>>'
        | '~' ;

Expression: (Operator | Identifier) 
        '(' (Identifier | Number)+ ')';
Parameter: Identifier
         | Expression
         | Number;
Statement: Keyword '(' Parameter+ ')';
Block: '{' Statement+ '}';

    /* Third Lexer Stage */

Add: '+';
Sub: '-';
Mlt: '*';
Div: '/';
Mod: '%';
Mathop: Add | Sub | Mlt | Div | Mod;

Deq: '==';
Neq: '!=';
Gtr: '>';
Lss: '<';
Geq: '>=';
Leq: '<=';
Condop: Deq | Neq | Gtr | Lss | Geq | Leq;

And: '&&';
Or: '||';
Xor: '^';
Bnd: '&';
Bor: '|';
Logop: And | Or | Xor | Bnd | Bor;

Neg: '!';
Boc: '~';
Negop: Neg | Boc;

Asl: '<<';
Asr: '>>';
Shftop: Asl | Asr;

Eql: '=';

Inc: '++';
Dec: '--';
Incop: Inc | Dec;

Peq: '+=';
Meq: '-=';
Teq: '*=';
Seq: '/=';
Req: '%=';
Casop: Peq | Meq | Teq | Seq | Req;

Lparen: '(';
Rparen: ')';
Lbrack: '[';
Rbrack: ']';
Lbrace: '{';
Rbrace: '}';
Point : '.';
Colon : ':';

Numvar: Number 
      | Identifier 
      | Mathop '(' Parameter+ ')';
Boolvar: Boolean
       | Identifier
       | Condop '(' Parameter+ ')'
       | Logop '(' Parameter+ ')';
Metaxpr: (Identifier | Operator ) '(' Parameter+ ')';

    /* First Parser Stage */

    //expressions

add: '+' '(' Numvar+ ')';
sub: '-' '(' Numvar+ ')';
mlt: '*' '(' Numvar+ ')';
div: '/' '(' Numvar+ ')';
mod: '%' '(' Integer+ ')';
mathexpr: add
        | sub
        | mlt
        | div
        | mod;

eql: '==' '(' Parameter+ ')';
neq: '!=' '(' Parameter+ ')';
gtr: '>' '(' Parameter+ ')';
les: '<' '(' Parameter+ ')';
geq: '>=' '(' Parameter+ ')';
leq: '<=' '(' Parameter+ ')';
condexpr: eql
        | neq
        | gtr
        | les
        | geq
        | leq;

and: '&&' '(' Parameter+ ')';
or : '||' '(' Parameter+ ')';
xor: '^' '(' Parameter+ ')';
bnd: '&' '(' Parameter+ ')';
bor: '|' '(' Parameter+ ')';
logexpr: and
       | or
       | xor
       | bnd
       | bor;

asl: '<<' '(' Parameter Numvar ')';
asr: '>>' '(' Parameter Numvar ')';
shiftexpr: asl | asr;

neg: '!' '(' Parameter ')';
boc: '~' '(' Parameter ')';
negexpr: neg
       | boc;

arrexpr: Identifier '[' Numvar ']';

    //instruction forms

vardec: 'def' '(' Identifier+ ')' ': ' Identifier ;
lindec: Identifier '(' Identifier ')';
assign: '=' '(' (Identifier | lindec) Parameter ')';

incstmt: (Inc | Dec) '(' Identifier ')'
       | Casop '(' Identifier Identifier ')';

cond: 'if' '(' Boolvar ')' Block
    ('else if' '(' Boolvar ')' Block)?
    ('else' Block)?;

loop: (
      ('while' '(' (condexpr | negexpr) ')')
    | ('for' '(' assign ',' (condexpr | negexpr) ',' incstmt')')
    )  Block;

fundef: 'func' '(' Identifier Parameter+ ')' ': ' Identifier Block;
prodef: 'proc' '(' Identifier Parameter* ')' Block;
call: Identifier '(' Parameter+ ')';

excHandler: 'try' Block
            'catch' '(' Identifier ')' Block
           ('finally' Block)?;

classdef: 'class' '(' Identifier ')' (': ' _Identifier_)? _Block_;
1

There are 1 best solutions below

3
On

ANTLR requires unambiguous grammar rules. In the provided grammar, the Symbol rule conflicts with the Operator rule and others. The Identifier and Letter rules conflict. Rules conflict when they can match the same input (content & length).

Also, for example, the Symbol rule includes '{' as an alt. Subsequent rules that use the literal '{' (which is an implicit token type) in any of their alts will not match because the implicit token type is not the same as the Symbol token type. Best practice is to avoid redundant use of literals - define the literal in a rule, and then just reference that rule.

Best advice would be to buy a copy of TDAR to learn Antlr4.