How does LR parsing parse the word "a b" in this grammar: S -> a b | a T ; T -> a

71 Views Asked by At

Suppose I have a grammar

S -> a b | a T
T -> a

Clearly, the grammar accepts {aa, ab}. But I am confused how LR parsing parses the word "a b". Naively, it could work like this, by reducing the first "a" to T and then it gets stuck or has to backtrack.

a a => T a => stuck or backtrack

How would an LR parser knows that it should NOT reduce the first "a" to T?

2

There are 2 best solutions below

5
On BEST ANSWER

In this case, the parser will never try reducing T because the production T → a is not in any state reached from S. The initial state has items:

S → • a b
S → • a T

and the only action possible in this state is a shift action with the token a. Since a is, in fact, the next input character, we do a shift transition to a state whose itemset is

S → a • b
S → a • T
T → • a

This state has no reduce actions either, and it has two distinct shift actions: one on b and the other one on a. Since the next input is b, that shift action is taken, leading to a state whose itemset is

S → a b •

in which only a reduction is available.

A slightly more interesting case would be the rather similar grammar

S → a b
S → T a
T → a

Here, the itemset for the initial state does include the production for T:

S → • a b
S → • T a
T → • a

It's still the case that the only action available in the initial state is to shift a, but now after doing the shift we find ourselves in a state whose itemset is:

S → a • b
T → a •

and now we have two possible actions: a shift of b and a reduction of T → a. Here, the parser needs to use its ability to look one token into the future (assuming that its an LR(1) parser).

But to let it do so, we need to do a small adjustment. Grammars are always "augmented" before the parsing automaton is constructed. The augmented grammar adds explicit recognition of the end of input by adding a unique end-of-input character which can also participate in lookahead checks. The augmented grammar is:

S'→ S $
S → a b
S → T a
T → a

In this grammar, we can see that the nonterminal T can only be followed by the symbol a, and this fact is encoded into the state transition tables, where each item in an itemset is actually annotated with a set of possible lookaheads. (During table construction, all items are annotated, but for the computation of legal actions, only the lookaheads for reductions are considered; a reduction is an item whose • is at the end.)

With this modification, the itemset reached after shifting a is:

S → a • b
T → a •    [ lookahead: { a } ]

and the lookahead makes it absolutely clear which action should be chosen. If the next input is b, it is shifted. If the next input is a, T is reduced.

That precision is not available with LR(0) parsing, in which lookahead is not used. The modified grammar is not LR(0), precisely for the reason you mention: that it cannot decide whether or not to reduce T. This issues comes up pretty frequently, which is why LR(0) is rarely if ever useful in practice.

0
On

Given the example input of "a b"... After reading the 'a', the parser is in a state with the items:

    S -> a . b
    S -> a . T
    T -> . a

Since none of these items has the dot at the end of the production, the state does not call for any reduce action. The only available action is to shift, so the parser reads the next symbol. No need for lookahead.