I am writing a tool based on Clang libtooling that checks and warns against functions that are too similar. After obtaining a clang::FunctionDecl, I want to perform some similarity check on the source code.
Currently, I can get the source text following this question, but source-text-based similarity check is not precise enough and too slow. Is there a way of getting the source code in the form of a token sequence? It would be helpful if I can write something like this:
SomeContainer<Token> tokens = getTokenSequence(funcDecl);
for (const auto &t : tokens)
// ...
The main way that Clang Libtooling offers for getting the token sequence is to first get the text by calling
clang::SourceManager::getBufferOrNone, then runclang::Lexeron it to get the tokens. This runs the lexer in "raw" mode, meaning it does not do any preprocessing, and does not remember what happened when preprocessing happened during parsing.Here is a visitor function that prints the tokens of every function definition (an excerpt of the complete program further below):
When this visitor is run on the input:
it prints:
As an alternative to the "raw" lexer, if you want details about the preprocessor actions performed during the real parse, you could use
clang::PPCallbacksto hook into it while it runs. But that seems like overkill for your intended purpose, so I won't elaborate in this answer.Complete example program:
Makefile: