How to fix shift-reduce conflict in this grammar

459 Views Asked by At

The following is a subset of a GOLD parser grammar

<Start> ::= <Expression>

<Expression> ::= <RelationalExpression>

<RelationalExpression> ::= <RelationalExpression> <RelationalOp> <Factor>
                         | <Factor>

<Factor> ::= Id <FactorExtension>
           | <Literal>

<FactorExtension> ::= <FactorMember> <FactorExtension>
                    | <FactorIndexer> <FactorExtension> 
                    | <GenericIndexer> <FactorExtension> // Here is the problem! 
                    |

<FactorMember> ::= '.' Id

<FactorIndexer> ::= '[' <Expression> ']'

<GenericIndexer> ::= '<' <TypeName> '>'

<TypeName> ::= <Id>

<Literal> ::= DecLiteral | HexLiteral | FloatLiteral | RealLiteral | StringLiteral

<RelationalOp> ::= '=' | '<>' | '>' | '>=' | '<' | '<='

<Id> ::= <Id> '.' Id
       | Id

GOLD generates LR(1) parsers, so at this moment I'm about to be convinced that there is no solution for my problem, but I decided to ask here just to be sure.

The language generated by this grammar simply recognizes very simple expressions, like

Instance.Member < 50

However, the idea is that this language can recognize expressions like this

Intance.Member<Integer> < 50

Problem is that the part of the grammar that would recognize <Integer> (named <GenericFactor>) generates a shift-reduce conflict with the relational operator <. I can see why this conflict occurs, what I want is a way of reorganizing the grammar in order to remove it.

Is that possible?

Thanks in advance!

1

There are 1 best solutions below

1
On BEST ANSWER

Basically, that language is going to be difficult to express with an LR(1) grammar.

It's not impossible, but the resulting grammar will not generate the parse trees you would want, so you'll need to post-process the AST. Also, it's really painful to write all the productions, so changing the grammar will be difficult.

Unfortunately, GOLD parser does not provide you with another option, as far as I know. However, bison allows you to generate a GLR parser, which will have no trouble parsing the language since it is unambiguous, and as a bonus you can write your grammar in a convenient and readable format. That's the solution I recommend.

Here's an alternative, apparently used by some Java and/or C# parsers (but not the official ones), based on the following observation:

  • It's always possible to tell whether a > is the end of a generic bracket or a relational operator, by examining the next token. In your grammar, a > close bracket can only be followed by ., [, or the end of input; none of those tokens can follow a relational operator.

  • A < which is a generic bracket must follow an Id, and may only be followed by Ids and periods until the match >.

So when you see a possibly qualifying < (in the lexical scanner), you can start reading ahead. (You might want to cache the tokens in order to avoid rescanning them.) If you encounter any token which cannot be inside a <...> generic selector, then you know the < was a relational operator. Otherwise, you will eventually reach a > and then you can examine the following token; if that token qualifies as following a generic selector, then the < and > can be emitted as the OPEN_GENERIC and CLOSE_GENERIC tokens instead of relational operators.

Once you figure it out, you start playing the pre-read tokens back into the parser.

The precise details can be more or less complicated, depending on your grammar:

  • Most languages with generics allow nested generics: vector<list<int>>. In that case, you need to keep a push-down stack in the prescan, or even use an alternative parser. In fact, what a typename looks like can be quite complicated, but it always follows some sort of grammar.

  • With nested generics and right-shift operators, you might have to also decide whether >> is two close selector brackets or an operator. However, you can still decide based on the following token; if the outer > is a generic selector bracket, the inner one must be, too.

  • C++ (unlike Java or C#) allows non-type template parameters, so it's actually legitimate for a > b to be a (boolean) template parameter. C++ insists that the > be protected by parenthesizing the expression, which is probably a good idea if you're going to go down that route.

  • Generic functions create a particular problem, because they can be followed by a ( (argument list), which can also follow a relational operator. C++ solves this problem by actually knowing whether the potential template function name is a template name or not, which actually generally solves the generic bracket recognition problem because C++ does not allow you to take the address of a templated function without a full template argument list. Consequently, in C++ the < following a templated function name can only be a generic selector bracket. In Java, when you call a templated function, you need to put the template parameters first, in which case the < becomes unambiguous. (But that's pretty ugly, IMHO). In C#, the grammar mavens decided by fiat that if a > which might be a selector bracket is followed by a ( then it will always be interpreted as a selector bracket. I don't program in C#, but it seems to me like the C# equivalent of C++'s most vexing parse, except that it will probably almost always generate some error message.

On the whole, I've always been partial to the solution of not using <...> as selector brackets. Scala uses [...] and D uses !<...>; both of those seem like very reasonable alternatives.