Alternative to backing up in a scanner?

Question

Alternative to backing up in a scanner?

88 Views Asked by Tyler Durden At 28 June 2018 at 09:05

I sometimes find myself backing up in a scanner. Here is a typical example:

    private static final JSON_Value readArray( StringBuffer sbContent, StringBuffer sbError ){
        JSON_Value value_array = JSON_Value.createNull();
        char c;
BeginningOfArray:
        while( true ){
            pos++;
            c = sbContent.charAt( pos - 1 );
            switch( c ){
                case '\n': case '\r': case '\t': case ' ': case '\f': // blank space at beginning of array
                    continue;
                case ']':
                    return JSON_Value.createEmptyArray();
                case ',':
                    value_array.size = 1;
                    break BeginningOfArray; // process next value
                default:
                    pos--;  <-- BACKING UP
                    value_array.size = 0;
                    break BeginningOfArray; // process next value
            }
        }
        while( true ){
            pos++;
            c = sbContent.charAt( pos - 1 );
            switch( c ){
            ... etc

In the scanner above which processes a JSON array, the first element in the array might be missing, in which case the scanner encounters a comma first, otherwise it will encounter a some character that will begin a value. Since the main loop always begins by advancing, if a non-comma character was encountered, then I backup so it becomes the first character.

Is there an alternative to this that keeps the scanner going forwards at all points, or is backing up in certain situations inevitable?

Original Q&A

There are 1 best solutions below

**rici** · Answer 1 · 2018-06-29T02:18:05.623000

It depends on the precise nature of the lexical structure of the language you are analysing, but usually backing up can be eliminated. However, it is often more convenient to leave it in, thereby avoiding a certain amount of code duplication and/or awkward control structures.

It's useful to be precise about what you mean by "backing up", because I wouldn't normally consider the example in your question to be an instance. The vast majority of tokens cannot be recognised without looking at the character following the token (for example, an identifier extends until the next character is not alphanumeric). Although it can be avoided with some work, it is normal to consider the terminating character twice: once to determine that it cannot be part of the previous token, and then again to determine what kind of token it might initiate.

Whether that looks like backing up or not will depend to some extent on the paradigm you use to read the input. In general terms, there are two possible paradigms (and a number of mix-and-match implementations which don't cleanly fit either model):

peek and accept.

peek returns the next input character without modifying the position of the read cursor, so two consecutive peeks will see the same character. accept moves the read cursor to the next position without returning anything. Handling a character will therefore require one (or more) peek operations, and exactly one accept.

This is probably the more efficient paradigm, but it's a bit of a nuisance to write by hand because of the need for an explicit accept.

For a simple in-memory input buffer, peek() is basically return inputBuffer.charAt(pos); and accept() is ++pos;; both of these will normally be in-lined, either by the compiler or by the programmer.
get and unaccept.

get returns the next input character and advances the read cursor by one position, so that the next get will read the next character. unaccept moves the read cursor back one character, so that the next get will reread the previous character. (This is not quite the same as the standard C library's ungetc, which allows a different character to be inserted into the read buffer, and which may allow more than one character to be ungotten. More on that later.)

For a simple in-memory input buffer, get() is basically c = peek(); accept(); return c; and unaccept() is --pos. Again, they will normally be inlined.

Since a get(); unaccept(); sequence is precisely the same as peek();, the two paradigms are pretty well isomorphic. Despite the fact that unaccept() includes a decrement to pos, I have a hard time thinking of it as "backing up". It is simply rereading a character, just as with the peek/accept paradigm.

An excerpt from a highly-simplified and partial lexical analyser (non-contextual) might look something like this:

c = peek();
while (isblank(c))   { accept(); c = peek(); }
if (c == '<') {
  accept();
  c = peek();
  if (c == '=')      { accept(); return Token_LessEquals; }
  else if (c == '<') { accept(); return Token_LeftShift; }
  else return Token_Less;
}
if (isalpha(c)) {
  tstart = pos;
  accept();
  while (isalnum(peek())) { accept(); }
  return makeIdToken(tstart, pos);
}
// etc.

The same snippet written in get/unaccept style:

c = get();
while (isblank(c))     { c = get(); }
if (c == '<') {
  c = get();
  if (c == '=')        { return Token_LessEquals; }
  else if (c == '<')   { return Token_LeftShift; }
  else                 { unaccept(); return Token_Less; }
}
if (isalpha(c)) {
  tstart = pos - 1;
  while (isalnum(get())) { }
  unaccept();
  return makeIdToken(tstart, pos);
}
// etc.

Since unaccept() is common (though not universal) when tokens are accepted, it's tempting to remove it from the action and simply prefix it before we start trying to recognise the token. Of course, if we do that, we need to add a redundant get() (or accept()) in the cases where we found the end of the token without a lookahead:

unaccept();
c = get();
while (isblank(c))     { c = get(); }
if (c == '<') {
  c = get();
  if (c == '=')        { accept(); return Token_LessEquals; }
  else if (c == '<')   { accept(); return Token_LeftShift; }
  else                 { return Token_Less; }
}
if (isalpha(c)) {
  tstart = pos - 1;
  while (isalnum(get())) { }
  return makeIdToken(tstart, pos);
}
// etc.

Now, the only unaccept() is safely hidden away in the prolog to the tokeniser. (As it happens, this is effectively the style used by Flex. The "redundant" accept()s are actually tucked away in the transition tables. There is no "token accept" action, so after seeing the second character of <<, the scanner continues to read a lookahead character, and then discovers that there is no transition using that character. It then accepts the token, and the lookahead character is reread on the next call to yylex();. That may seem a bit clunky but it turns out that the simplification of the main loop is worth more than the occasional reading of an unneeded lookahead.

Can we get rid of that last remaining unaccept()? Yes, we can, if we're willing to invert control flow by using a push-parser. (I'm very fond of this solution, myself.) In this model, the lexical scanner drives the parsing process, calling the parser every time it has a token. That doesn't play well with recursive descent parsers, but it can work just fine with a parser which maintains its own stack. That's obviously easier for a generated parser than for a hand-written one, and several LALR(1) parser generators support this style. (Nothing stops a top-down parser from using its own stack rather than the call stack, but I don't know if any top-down generators cooperate.)

In this model, the lexer has its own loop for reading tokens, and the lookahead character is retained over calls to the parser:

c = get();   /* Start by reading the first character */
for (;;) {   /* Token loop assumes that every action will read a lookahead */
  while (isblank(c))   { c = get(); }
  if (c == '<') {
    c = get();
    if (c == '=')      { c = get(); parse(Token_LessEquals); }
    else if (c == '<') { c = get(); parse(Token_LeftShift); }
    else               {            parse(Token_Less); }
  }
  if (isalpha(c)) {
    tstart = pos - 1;
    while (isalnum(get())) { }
    parse(makeIdToken(tstart, pos));
  }
  // etc.
}

There is one more interesting alternative if you are using push parsers, but I've only ever seen it implemented with generated scanners with transition tables; for open-coded scanners, it would be a lot of code duplication.

In a scanner action for a push-parser, nothing stops us from calling the parser twice (or even more). And if you think about it, the most common case in JSON, for example, is that a value be immediately followed by a comma or close bracket. If that's the case, we could save an action dispatch by using a collection of patterns which match the value and the following operator character, and feed both of them into the parser. (If the value is followed by whitespace, the whitespace character can just be discarded; there is no need to rescan it. So unless the value is followed by a multi-character token -- and in JSON, that's not possible afaik -- this optimisation would be a win.)

Actually implementing that optimisation is annoying, but a scanner generator could easily accomplish it. I don't know of any which do, though.

All of the above was pretty well stylistic, and I'm not sure that it really addresses the question. Because there are cases in many, if not most languages, where backtracking -- real backtracking, not just rereading the transition character -- is much harder to get rid of.

Suppose the language has a token which is a prefix of another token, but the intermediate prefixes are not valid tokens. As a concrete example, C tokens include . and ... but not ~~..~~. So the sequence a..b must be tokenised as a..b while a..b is a...b. Before the token . can be emitted in the first case, both the second . and the following b must be read. And if it turns out to be a b, then the scanner must back up an extra character. That back-up is very hard to get rid of.

A somewhat similar example occurs in Rust with the range operator ... Here, the issue is that 1..5 is a range expression 1..5, while 1.0 is a floating point literal. Again the 1 token cannot be emitted until two lookahead characters have been consumed, and if both of them are ., then either they need to be put back into the input stream or (if control flow inversion is available) two tokens need to be emitted. So in this case, it is easier to get rid of the backing up, but only with a particular architecture. That would be trickier in the C case, because a..3 is two tokens and the start of a third one, a..3….

Flex can be asked to warn you if you write lexical specifications which require back-up (in the sense of the last couple of paragraphs; i.e., more than one character). That's because:

Handling the possibility of backup makes the generated scanner slightly slower throughout, although the slowdown is nowhere near as relevant today as it was when Flex was first written, and more importantly,
Such lexical specifications are usually mistakes.

As an example of such a mistake, consider the following two rules in a scanner:

["][^"\n]*["]    { yylval = strndup(yytext + 1, yyleng - 2); return STRING; }
  // ...
.                { return *yytext; /* Standard fallback rule */ }

An unterminated string literal will cause the fallback rule to be chosen, which involves backing up the entire length of the string. That backup is unnecessary, and will only lead to confusion if the parser attempts error recovery.

Languages which actually require long backups are pretty rare, but I'm certainly not going to go out on a limb and say they don't exist. I'm sure there are some. If possible backup is not limited to a fixed length, then the standard maximum munch scanning algorithm becomes quadratic. With some work, it is possible to do a linear-time scan even in the face of arbitrary backup, but the cost is increased space consumption.

Alternative to backing up in a scanner?

There are 1 best solutions below

Related Questions in COMPILER-CONSTRUCTION

Related Questions in LANGUAGE-THEORY

Trending Questions

Popular # Hahtags

Popular Questions