flex: segfault when reading to EOF with `input()`

31 Views Asked by At

The program below works as expected when reading from stdin, but segfaults when it is instead lexing a buffer.

The key thing about this example is that one of the rules uses input() to gobble from "!" to EOF (yes, it looks as if I could use a "!".* pattern, but that doesn't produce the intended results in the real case).

$ flex -o eof.c eof.lex
$ cc -o eof eof.c
$ echo -n 'one two!three four' | ./eof
word:<one>
-> 1
-> 2
word:<two>
-> 1
buf=<three four>
-> 3

That's fine, but when I instead ./eof 'one two !three four', which scans the contents of a buffer set up by yy_scan_string, I get a segfault inside yy_get_next_buffer.

I can't work out which part of the flex manual is telling me I should expect that to happen. Can anyone point out to me what I'm doing wrong?

What's happening is that the lexer is finding its way to the end of file, as expected (and an <<EOF>> action confirms this), but not stopping there, despite the presence of the noyywrap option, and collapsing when it can't find a ‘next’ buffer.

Points:

  • Option -d doesn't illuminate.
  • It is, of course, a little hard to follow what the generated code is doing, but looking at the location of the segfault, it is indeed around the place where the code is checking for yywrap, so it should be getting the message that there is no more input coming.
  • The only real illustration of using input(), in the flex manual, is in a case where hitting EOF is reported as an error. Here, I'm doing essentially the same as in that example, but regarding EOF as an acceptable end of the scan.
  • The same behaviour appears when using a reentrant scanner.
  • It's worth noting that input() returns 0, not EOF, at EOF, despite what Sect.8 illustrates (cf. flex repo issue, and links there), and despite the rather mysterious note about a ‘“real” end-of-file’ in Sect.20. I have a suspicion that this remark in Sect.20 is telling me something terribly important, but I can't work out what.
  • The -n option to echo means that it is not supplying a newline, so the sequence of input characters should be identical in the two cases.
  • This is with flex 2.6.4, on both macOS and Linux.
  • The flex-help list, pointed to by the flex repo, seems to be moribund.

Program:

ALPHABETIC  [a-zA-Z]
WS      [^a-zA-Z!]

%option noyywrap nounput

%%

{ALPHABETIC}+   {
    printf("word:<%s>\n", yytext);
    return 1;
}
{WS}+   {
    return 2;
}

"!"         {  // gobble to end of input
    char buf[80];
    for (int idx=0; (buf[idx] = input()); idx++) /* empty */ ;
    printf("buf=<%s>\n", buf);
    // YY_FLUSH_BUFFER; /* makes no difference */
    return 3;
}

%%
int main(int argc, char** argv)
{
    switch (argc) {
      case 1: break;
      case 2:
        yy_scan_string(argv[1]);
        break;
      default:
        fprintf(stderr, "Usage: %s [string]\n", argv[0]);
        exit(1);
    }

    int token;
    while ((token = yylex()) != 0) {
        printf("-> %d\n", token);
    }
}

Further thoughts (edit):

There's another question where the answer discusses flex's behaviour at EOF (thanks to the commenter for the pointer). But that answer expands on the problem of input() returning zero rather than EOF, as noted above, and points to a similar set of github issues, without really coming to any very definite conclusions. What I'm seeing here still looks rather like a bug in buffer/EOF/yywrap handling, without me being able to point to a line in the documentation which makes that definitive.

Further edit: clarify that it's not a newline at the end of the stdin version that makes it work.

On reflection, I think this is, after all, a candidate for a bug in flex, and I've reported it as issue 636 there.

0

There are 0 best solutions below