ANTLR White Space Question (and not the typical one)

6.9k Views Asked by At

Consider this short SmallC program:

#include "lib"
main() {
    int bob;
}

My ANTLR grammar picks it up fine if I specify, in ANTLWorks and when using the Interpreter, line endings -> "Mac (CR)". If I set the line endings option to Unix (LF), the grammar throws a NoViableAltException and does not recognize anything after the end of the include statement. This error disappears if I add a newline at the end of include. The computer I'm using for this is a Mac, so I figured that it made sense to have to set the line endings to Mac format. So instead, I switch to a Linux box - and get the same thing. If I type anything in the ANTLRWorks Interpreter box, and if I don't select line endings Mac (CR), I get issues about insufficient blank lines as was the case above and, in addition, the last statement of each statement block requires an extra space following the semicolon (ie. after bob; above).

These bugs show up again when I run a Java version of my grammar on a code input file that I want to parse...

What could possibly be the issue? I'd understand if the issue was the presence of TOO many new lines, in a format that perhaps the parser didn't understand / weren't caught by my whitespace rule. But in this case, it's an issue of lacking new lines.

My white space declaration is as follows:

WS      :   ( '\t' | ' ' | '\r' | '\n' )+   { $channel = HIDDEN; } ;

Alternatively, could this be due to an ambiguity issue?

Here is the full grammar file (feel free to ignore the first few blocks, which override ANTLR's default error handling mechanisms:

grammar SmallC;

options {
    output = AST ;  // Set output mode to AST
}

tokens {
    DIV = '/' ;
    MINUS   = '-' ;
    MOD = '%' ;
    MULT    = '*' ;
    PLUS    = '+' ;
    RETURN  = 'return' ;
    WHILE   = 'while' ;

    // The following are empty tokens used in AST generation
    ARGS ;
    CHAR ;
    DECLS ;
    ELSE ;
    EXPR ;
    IF ;
    INT ;
    INCLUDES ;
    MAIN ;
    PROCEDURES ;
    PROGRAM ;
    RETURNTYPE ;
    STMTS ;
    TYPEIDENT ;
}

@members { 
// Force error throwing, and make sure we don't try to recover from invalid input.
// The exceptions are handled in the FrontEnd class, and gracefully end the
// compilation routine after displaying an error message.
protected void mismatch(IntStream input, int ttype, BitSet follow) throws RecognitionException {
    throw new MismatchedTokenException(ttype, input);
} 
public Object recoverFromMismatchedSet(IntStream input, RecognitionException e, BitSet follow)throws RecognitionException {
    throw e;
}
protected Object recoverFromMismatchedToken(IntStream input, int ttype, BitSet follow) throws RecognitionException {
     throw new MissingTokenException(ttype, input, null);
}

// We override getErrorMessage() to include information about the specific
// grammar rule in which the error happened, using a stack of nested rules.
Stack paraphrases = new Stack();
public String getErrorMessage(RecognitionException e, String[] tokenNames) {
    String msg = super.getErrorMessage(e, tokenNames);
    if ( paraphrases.size()>0 ) {
        String paraphrase = (String)paraphrases.peek();
        msg = msg+" "+paraphrase;
    }
    return msg;
}

// We override displayRecognitionError() to specify a clearer error message,
// and to include the error type (ie. class of the exception that was thrown)
// for the user's reference. The idea here is to come as close as possible
// to Java's exception output.
public void displayRecognitionError(String[] tokenNames, RecognitionException e)
{
    String exType;
    String hdr;
    if (e instanceof UnwantedTokenException) {
        exType = "UnwantedTokenException";
    } else if (e instanceof MissingTokenException) {
        exType = "MissingTokenException";
    } else if (e instanceof MismatchedTokenException) {
        exType = "MismatchedTokenException";
    } else if (e instanceof MismatchedTreeNodeException) {
        exType = "MismatchedTreeNodeException";
    } else if (e instanceof NoViableAltException) {
        exType = "NoViableAltException";
    } else if (e instanceof EarlyExitException) {
        exType = "EarlyExitException";
    } else if (e instanceof MismatchedSetException) {
        exType = "MismatchedSetException";
    } else if (e instanceof MismatchedNotSetException) {
        exType = "MismatchedNotSetException";
    } else if (e instanceof FailedPredicateException) {
        exType = "FailedPredicateException";
    } else {
        exType = "Unknown";
    }

    if ( getSourceName()!=null ) {
        hdr = "Exception of type " + exType + " encountered in " + getSourceName() + " at line " + e.line + ", char " + e.charPositionInLine + ": "; 
    } else {
        hdr = "Exception of type " + exType + " encountered at line " + e.line + ", char " + e.charPositionInLine + ": "; 
    }
    String msg = getErrorMessage(e, tokenNames);
    emitErrorMessage(hdr + msg + ".");
}
}

// Force the parser not to try to guess tokens and resume on faulty input,
// but rather display the error, and throw an exception for the program
// to quit gracefully.
@rulecatch {
catch (RecognitionException e) {
    reportError(e);
    throw e;
} 
}

/*------------------------------------------------------------------
 * PARSER RULES
 *
 * Many of these make use of ANTLR's rewrite rules to allow us to
 * specify the roots of AST sub-trees, and to allow us to do away
 * with certain insignificant literals (like parantheses and commas
 * in lists) and to add empty tokens to disambiguate the tree 
 * construction
 *
 * The @init and @after definitions populate the paraphrase
 * stack to allow us to specify which grammar rule we are in when
 * errors are found.
 *------------------------------------------------------------------*/

args
@init { paraphrases.push("in these procedure arguments"); }
@after { paraphrases.pop(); }
        :   ( typeident ( ',' typeident )* )?   ->  ^( ARGS ( typeident ( typeident )* )? )? ;

body
@init { paraphrases.push("in this procedure body"); }
@after { paraphrases.pop(); }
        :   '{'! decls stmtlist '}'! ;

decls
@init { paraphrases.push("in these declarations"); }
@after { paraphrases.pop(); }
        :   ( typeident ';' )*  ->  ^( DECLS ( typeident )* )? ;

exp
@init { paraphrases.push("in this expression"); }
@after { paraphrases.pop(); }
        :   lexp ( ( '>' | '<' | '>=' | '<=' | '!=' | '==' )^ lexp )? ;

factor      :   '(' lexp ')'
        |   ( MINUS )? ( IDENT | NUMBER ) 
        |   CHARACTER
        |   IDENT '(' ( IDENT ( ',' IDENT )* )? ')' ;

lexp        :   term ( ( PLUS | MINUS )^ term )* ;

includes
@init { paraphrases.push("in the include statements"); }
@after { paraphrases.pop(); }
        :   ( '#include' STRING )*  ->  ^( INCLUDES ( STRING )* )? ;

main    
@init { paraphrases.push("in the main method"); }
@after { paraphrases.pop(); }
        :   'main' '(' ')' body ->  ^( MAIN body ) ;

procedure
@init { paraphrases.push("in this procedure"); }
@after { paraphrases.pop(); }
        :   ( proc_return_char | proc_return_int )? IDENT^ '('! args ')'! body ;

procedures  :   ( procedure )*  ->  ^( PROCEDURES ( procedure)* )? ;

proc_return_char
        :   'char'  ->  ^( RETURNTYPE CHAR ) ;

proc_return_int :   'int'   ->  ^( RETURNTYPE INT ) ;

// We hard-code the regex (\n)* to fix a bug whereby a program would be accepted
// if it had 0 or more than 1 new lines before EOF but not if it had exactly 1,
// and not if it had 0 new lines between components of the following rule.
program     :   includes decls procedures main EOF ;

stmt
@init { paraphrases.push("in this statement"); }
@after { paraphrases.pop(); }
        :   '{'! stmtlist '}'!
        |   WHILE '(' exp ')' s=stmt    ->  ^( WHILE ^( EXPR exp ) $s )
        |   'if' '(' exp ')' s=stmt ( options {greedy=true;} : 'else' s2=stmt )?    ->  ^( IF ^( EXPR exp ) $s ^( ELSE $s2 )? )
        |   IDENT '='^ lexp ';'! 
        |   ( 'read' | 'output' | 'readc' | 'outputc' )^ '('! IDENT ')'! ';'!
        |   'print'^ '('! STRING ( options {greedy=true;} : ')'! ';'! )
        |   RETURN ( lexp )? ';'    ->  ^( RETURN ( lexp )? ) 
        |   IDENT^ '('! ( IDENT ( ','! IDENT )* )? ')'! ';'!;

stmtlist    :   ( stmt )*   ->  ^( STMTS ( stmt )* )? ;

term        :   factor ( ( MULT | DIV | MOD )^ factor )* ;

// We divide typeident into two grammar rules depending on whether the
// ident is of type 'char' or 'int', to allow us to implement different
// rewrite rules in each case.
typeident   :   typeident_char | typeident_int ;

typeident_char  :   'char' s2=IDENT ->  ^( CHAR $s2 ) ;

typeident_int   :   'int' s2=IDENT  ->  ^( INT $s2 ) ;

/*------------------------------------------------------------------
 * LEXER RULES
 *------------------------------------------------------------------*/

// Must come before CHARACTER to avoid ambiguity ('i' matches both IDENT and CHARACTER)
IDENT       :   ( LCASE_ALPHA | UCASE_ALPHA | '_' ) ( LCASE_ALPHA | UCASE_ALPHA | DIGIT | '_' )* ;

CHARACTER   :   PRINTABLE_CHAR
        |   '\n' | '\t' | EOF ;

NUMBER      :   ( DIGIT )+ ;

STRING      :   '\"' ( ~( '"' | '\n' | '\r' | 't' ) )* '\"' ;

WS      :   ( '\t' | ' ' | '\r' | '\n' | '\u000C' )+    { $channel = HIDDEN; } ;

fragment 
DIGIT       :   '0'..'9' ;

fragment
LCASE_ALPHA :   'a'..'z' ;

fragment
NONALPHA_CHAR   :   '`' | '~' | '!' | '@' | '#' | '$' | '%' | '^' | '&' | '*' | '(' | ')' | '-'
        |   '_' | '+' | '=' | '{' | '[' | '}' | ']' | '|' | '\\' | ';' | ':' | '\''
        |   '\\"' | '<' | ',' | '>' | '.' | '?' | '/' ; 

fragment
PRINTABLE_CHAR  :   LCASE_ALPHA | UCASE_ALPHA | DIGIT | NONALPHA_CHAR ;
fragment
UCASE_ALPHA :   'A'..'Z' ;
1

There are 1 best solutions below

1
On BEST ANSWER

From the command line, I do get a warning:

java -cp antlr-3.2.jar org.antlr.Tool SmallC.g 
warning(200): SmallC.g:182:37: Decision can match input such as "'else'" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input

but that won't stop the lexer/parser from being generated.

Anyway, the problem: ANTLR's lexer tries to match the first lexer rule it encounters in the file, and if it can't match said token, it trickles down to the next lexer rule. Now you have defined the CHARACTER rule before the WS rule, which both match the character \n. That is why it didn't work under Linux since the \n was tokenized as a CHARACTER. If you define the WS rule before the CHARACTER rule, it all works properly:

// other rules ...

WS
  :  ('\t' | ' ' | '\r' | '\n' | '\u000C')+ { $channel = HIDDEN; } 
  ;

CHARACTER   
  :  PRINTABLE_CHAR | '\n' | '\t' | EOF 
  ;

// other rules ...

Running the test class:

import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;

public class Main {
    public static void main(String[] args) throws Exception {
        String source = 
                "#include \"lib\"\n" + 
                "main() {\n" + 
                "   int bob;\n" + 
                "}\n";
        ANTLRStringStream in = new ANTLRStringStream(source);
        SmallCLexer lexer = new SmallCLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        SmallCParser parser = new SmallCParser(tokens);
        SmallCParser.program_return returnValue = parser.program();
        CommonTree tree = (CommonTree)returnValue.getTree();
        DOTTreeGenerator gen = new DOTTreeGenerator();
        StringTemplate st = gen.toDOT(tree);
        System.out.println(st);
    }
}

produces the following AST:

enter image description here

without any error messages.

But you should fix the grammar warning, and remove \n from the CHARACTER rule since it can never be matched in the CHARACTER rule.

One other thing: you've mixed quite a few keywords inside your parser rules without defining them in your lexer rules explicitly. That is tricky because of the first-come-first-serve lexer rules: you don't want 'if' to be accidentally being tokenized as an IDENT. Better do it like this:

IF : 'if';
IDENT : 'a'..'z' ... ; // After the `IF` rule!