Adding PreProcessing layer to ANTLR4 without removing offsets of tokens

1k Views Asked by At

I currently try to implement a PreProcessor into ANTLR4 for c# and got quite a lot of trouble finding any info about this. My digging through the GitHub sources of the C# project did not really have been successfull sadly ...

My goal is to retain the token offsets (so that column and line would not be thrown of by a preprocessed stream) example to roughly explain:

 #define foo(bar) foobar(bar + bar * bar / 0.2)
 int smthng = 2;
 smthng += foo(12); //the ; should be at the same spot like the macro was a function

hope that you can point me to the correct docs or provide a simple example solution somewhere

kindly regards, Marco

PS to note it: i do not seek for a solution in which i pass an already preprocessed stream to ANTLR4 as that will fiddle with the offsets of the code.

2

There are 2 best solutions below

1
On

Depends on how you preprocess the stream. If you replace all preprocessor lines (and those not visible due to #ifdef etc.) with line breaks your overall line number won't be messed up. In fact I even recommend to do preprocessing outside of the normal parse run (e.g. the input stream do that already).

Years ago I wrote a Windows .rc file parser. These resource files are in many aspects like C header files, so you need a preprocessor, macro handling with stringizing and charizing support and a few more things. I wrote it for ANTLR 2.7 (now you see how old it is :-) ). But still I believe it's a good example how to make preprocessing work (including #include).

1
On

There are two strategies of preprocessor directives parsing:

  • One-step Processing
  • Two-step Processing

The second way is not appropriate for you because of text locations spoiling due to the macro expanding.

With the first way, you are able to tokenize preprocessor directives in a single common lexer and thus save correct text locations.

See Objective-C grammar and a One-step Processing section in article "Parsing Preprocessor Directives in Objective-C":

One-step processing involves the simultaneous parsing of directives and tokens of the primary language. ANTLR introduces a system of channels isolating tokens by their type. For example, tokens of the primary language and hidden tokens (whitespaces and comments). The directive tokens can be added to a separate named channel.

In some cases, directives tokens also can be included in a common channel. It's more convenient. See token NS_OPTIONS and rule enumSpecifier for example:

enumSpecifier
    : 'enum' (identifier? ':' typeName)? (identifier ('{' enumeratorList '}')? | '{' enumeratorList '}')
    | ('NS_OPTIONS' | 'NS_ENUM') LP typeName ',' identifier RP '{' enumeratorList '}'
    ;

You can also parse preprocessor directives as simple strings and parse it later: DEFINE: '#define' ~[\r\n]*.

At Swiftify, Objective-C to Swift converter, we are using the one-step processing approach. I'm going to update Objective-C grammar soon.