Language lexing: better performance to lex a string all at once or individually?

1.1k Views Asked by At

I'm attempting to build my first C-like programming language, likely an interpreter and I've just made the first step aka the lexer.

I've thought about taking the lazy route by simply lexing the entire source code stream all at one then then have the parser process the data.

I've noticed that many other compilers and interpreters only lex during parsing when the parser module asks for another token.

Is it quicker in terms of code performance for a program to lex source code all at once then parse the resulting tokens or lex and parse tokens individually?

1

There are 1 best solutions below

3
On BEST ANSWER

"faster" is a bit of a fuzzy word. There are different kinds of speed (latency, absolute start-to-finish duration, compile speed, execution speed), and depending on how you implement your language's front-end and backend, either approach can be faster.

Also, faster is not always better. If your parser is technically faster, but uses too much memory, it could crash or at the least end up swapping, which would slow it down again. If your parser is lightning-fast but produces inefficient code, your users will pay for your faster development speed. You'll have to write actual code and run it in a profiler to be able to tell what is really better, and come up with which criteria are important to you.

Tokenizing/Lexing everything at once at the start means you might be able to optimize memory allocation and thus take less time resizing your token list etc., but it also means the entire file has to be lexed before it can even be partially parsed.

OTOH if you parse as-needed, you may have to append to your arrays in small steps more often, so you'll pay a memory penalty, but in case of e.g. an interpreted language like JavaScript, you may only have to parse the parts that are actually used for this run-through.

So a lot of it depends on the details of your language, and the hardware you expect to be running on. In embedded systems with little memory and no swap, you may have no choice but to progressively lex, as the whole program source code might not fit in memory. If your language's syntax needs a lot of lookahead, you may not see any benefit of progressively lexing because you're reading it all anyway...