JAVACC - Switching Lexical States Based on Context

59 Views Asked by At

I'm working on a parser that, based on a specific context, may support different tokens. Here's a simplified example:

<DEFAULT> TOKEN [IGNORE_CASE] : {
    < OPERATION: "op" > : OperationType |
    < OBJ0: "obj0" > : ExtendedContext |
    < OBJ1: "obj1" > : BaseContext 
}

<BaseContext, ExtendedContext> TOKEN [IGNORE_CASE] : {
     < ARG0:  "arg0"  > 
}

<ExtendedContext> TOKEN [IGNORE_CASE] : {
     < ARG1:  "arg1"  > 
}

The problem is that I can reach those contexts from different lexical states. Let's say:

<OperationType> TOKEN :
{
    < MODIFY : "modify" > : BaseContext, ExtendedContext
}

Of course, I understand that I cannot specify both lexical states here, but I would need something similar.


I've attempted to implement a SwitchTo strategy based on the context by defining functions that determine whether the operation belongs to an ExtendedContext or a BaseContext. However, this approach seems to break some functionalities, and I'm not sure if it would work as expected or if there is a better way to address the issue.

Example of solution that I tried (but does not works in all scenarios):

TOKEN_MGR_DECLS : {
    int contextLexState= BaseContext;

    void moveToContext(int contextLexState) {
        setLexStateContextNoSwitch(contextLexState);
        switchToContext();
    }

    void switchToContext() {
        SwitchTo(contextLexState);
    }

    void setLexStateContextNoSwitch(int contextLexState) {
        this.contextLexState = contextLexState;
    }
}

<DEFAULT> TOKEN [IGNORE_CASE] : {
    < OPERATION: "op" > : OperationType |
    < OBJ0: "obj0" > : { moveToContext(ExtendedContext); } |
    < OBJ1: "obj1" > : { moveToContext(BaseContext); }
}

<OperationType> TOKEN :
{
    < MODIFY : "modify" > : { switchToContext(); }
}

The parser should correctly parse something like:

op modify obj0 arg0

op modify obj1 arg1

obj1 arg0

obj0 arg0 ...

But not those:

op modify obj0 arg1

obj0 arg1

Since arg1 belong only to the extended context.

Any help would be usefull! Thanks.

2

There are 2 best solutions below

1
Jonathan Revusky On BEST ANSWER

Legacy JavaCC really just does not hardly deal with this problem. Actually, the main reason it tends not to work is that there is a longstanding problem in terms of LOOKAHEAD not working in conjunction with lexical states.

You really ought to do yourself a favor and consider using CongoCC which is a much more advanced version of the JavaCC tool. In particular, there are some articles on this whole context-sensitive tokenization problem here and also a key feature that CongoCC has, the ability turn on and off tokens in a given context. See here. If you have any further questions, you might consider asking them here

0
Theodore Norvell On

I think Maurice asks the right question here. Lexical states are useful when you need the lexer to treat the same character sequences diffidently in different contexts. E.g. in C the sequence "a = b + c" should be treated one way by the lexer if its inside a comment and another way if it's not. So it might make sense to use lexical state to deal with comments in C.

In your example I don't see need for lexical states. You could just have one state and handle everything in the grammar (BNF rules)

  S --> [ <OP> <MODIFY> ] (<OBJ0> Base | <OBJ1> Extended )
  Base --> <ARG0>
  Extended --> Base | <ARG1>