Using lex to tokenize without failing

185 Views Asked by Robert Martin At 23 June 2015 at 21:03

I'm interested in using lex to tokenize my input string, but I do not want it to be possible to "fail". Instead, I want to have some type of DEFAULT or TEXT token, which would contain all the non-matching characters between recognized tokens.

Anyone have experience with something like this?

Original Q&A

There are 2 best solutions below

Chris Dodd On 23 June 2015 at 22:33

Use the pattern . at the end of all your lex rules to match any character that isn't matched by any other rule. You may also need a \n rule to match newlines (a newline is the only character the . doesn't match)

If you want to combine adjacent non-matching characters into a single token, that is harder, and is more easily done in the parser.

user207421 On 24 June 2015 at 00:35

To expand on @Chris Dodd's answer, the final rule in any lex script should be:

. return yytext[0];

and don't write any single-character rules like "+" return PLUS;. Just use the special characters you recognize directly in the grammar, e.g. term: term '+' factor;.

This practice:

saves you a lot of lex rules
makes your grammar much more readable
returns illegal characters as tokens to the parser, where you can do anything you like with them, or nothing, in which case you get the benefit of yacc's error recovery.

Using lex to tokenize without failing

There are 2 best solutions below

Related Questions in YACC

Related Questions in LEX

Trending Questions

Popular # Hahtags

Popular Questions