I am using F#'s FsLex to generate a lexer. I have difficulties to understand the following two lines from a textbook. Why is the newline (\n) treated differently from the white space? In particular, what does "lexbuf.EndPos <- lexbuf.EndPos.NextLine" do differently from "Tokenize lexbuf"?
rule Tokenize = parse
| [' ' '\t' '\r'] { Tokenize lexbuf }
| '\n' { lexbuf.EndPos <- lexbuf.EndPos.NextLine; Tokenize lexbuf }
A
ruleis essentially a function that takes a lexer buffer as an argument. Each case on the left side of your rule matches a given character (e.g.,'\n') or class of characters ([' ' '\t' '\r']) in your input. The expression on the right size of the rule, inside the curly braces{ ... }, defines an action. The purpose of the definition you pasted in appears to be a tokenizer.The expression
Tokenize lexbufis a recursive call to theTokenizerule. In essence, this rule ignores whitespace character. Why? Because tokenizers aim to simplify the input. Whitespace typically has no meaning in a programming language, so this rule filters it out. Tokenized input generally makes writing your parser simpler later on. You'll eventually want to add other cases to yourTokenizerule (e.g., for keywords, assignment statements, and other expressions) to produce a complete lexer definition.The second rule, the one that matches
\n, also ignores the whitespace, but as you correctly point out, it does something different. What it's doing is updating the position of the end of the line (lexbuf.EndPos) with the position of the next line's end (lexbuf.EndPos.NextLine) before recursively callingTokenizeagain. Why? Presumably so that the end position is correct on the next recursive call.Since you're only showing a lexer fragment here, I can only guess as to what
lexbug.EndPosis used for, but it's pretty common to keep that information around for diagnostic purposes.