How to define a syntax that uses multiple dashes for nested "if" instructions?

89 Views Asked by At

I'm trying to create a Syntax Analyzer in Java (using CUP) that could recognise this piece of code:

if ¿b? then
~ a = 2;
~ if ¿b && c? then
~ ~ a = 3;
else
~ a = 4;

My productions used by the "if" statement are these that follow:

Instr ::= ...
       | IF CONOP Exp:e CONCL THEN CondInstrList:l
       ...
       ;
...
CondInstrList ::= CondInstrList CondInstr
       | /*empty*/
       ;
...
CondInstr ::= CONTROLD Instr
       | CONTROLD CondInstr
       ;

where Instr stands for instruction/statement, CondInstrList stands for Conditional Instruction List and CONTROLD stands for Control Dash (~). (CONOP and CONCL mean Condition Open/Close)

The problem is that with that grammar, the generated AST is as follows:

if
|-condition b
|-condInstrListT
  |---asig a = 2
  |---if
      |---condition b and c
      |---condInstrListT 
      |   |---asig a = 2
      |---condInstrListF
          |---asig a = 4

and so, the "else" part is associated with the inner "if".

I just don't know how to write a grammar that respects the way I want my language to be.

Any help is appreciated.

I can give more detail if needed.

1

There are 1 best solutions below

3
On BEST ANSWER

I don't think you can do what you have in mind through the grammar alone. But it is possible with a slightly different grammar and some help from your lexical analyzer.

Here's what to do: rather than treat the ~ marks as individual grammar symbols, have the lexical analyzer turn sequences of ~ at the start of a line into INDENT and OUTDENT tokens that will work in your grammar the same way that { and } work in Java. You keep track of a "current indent level", which starts at zero. At the start of each line, count the ~ characters. For each ~ in excess of the current indent level, generate an INDENT token and increase the current indent level; for each ~ less than the current indent level, generate an OUTDENT token and decrease the current indent level.

So your example text of

if ¿b? then
~ a = 2;
~ if ¿b && c? then
~ ~ a = 3;
else
~ a = 4;

would be tokenized as:

// Indent level = 0 and no ~, so no INDENT here
[IF] [CONOP] [ID b] [CONCL] [THEN]
// Indent level = 0, one ~, so one INDENT
[INDENT]
    // Indent level = 1
    [ID a] [OP =] [CONST 2] [SEMICOLON]
    // Indent level = 1, one ~, so no INDENT here
    [IF] [CONOP] [ID b] [OP &&] [ID c] [CONCL] [THEN]
    // Indent level = 1, two ~, so one INDENT
    [INDENT]
        // Indent level = 2
        [ID a] [ASSIGN] [CONST 3] [SEMICOLON]
        // Indent level = 2, lines starts with no ~, two OUTDENTs
    [OUTDENT]
    // Indent level = 1
[OUTDENT]
//Indent level = 0
[ELSE] // No ~ at start of this line, so no INDENT
// Indent level = 0; one ~, so one INDENT
[INDENT] 
    // Indent level = 1
    [ID a] [ASSIGN] [CONST 4] [SEMICOLON]
// End-of-input.  Indent level = 1, so 1 OUTDENT
[OUTDENT]
// Done; indent level = 0;

The INDENT and OUTDENT tokens would act in your grammar like left- and right-braces do in Java, so your grammar could look something like:

Instr ::= ...
       | IF CONOP Exp:e CONCL THEN INDENT CondInstrList:l OUTDENT
       ...
       ;
...
CondInstrList ::= CondInstrList Instr
       | /*empty*/
       ;
...

The language Python does the same thing but with just white space instead of ~. You can download Python source here if you're interested. Look for the files Grammar\Grammar and Parser\tokenizer.c.