tl;dr: How do you emulate the equivalent of C's #define
with jison without doing a pre-processing step?
I am working on a relatively simple grammar with a feature to assign an identifier to a chunk of code that can then be re-used later for brevity. Example:
# Valid grammar with various elements of different types
foo x.3 y.4 z.5
# Assign an id to a chunk of code. Everything after -> is assigned to id
fill_1-> bar a.1 b.2 c.3
# Use chunk of code later
# Should be equivalent to parsing: "baz d.4 bar a.1 b.2 c.3 e.5"
baz d.4 ["fill_1"] e.5
So far, I've got my parser setup to correctly identify assignment lines of code and store the part to the right of the '->' in a dictionary available to other parser actions. Code related to the define action provided below:
// Lexer
HSPC [ \t]
ID [a-zA-Z_][a-zA-Z0-9_]*
%%
{ID}{HSPC}*"->" {
this.begin("FINISHLINE");
yytext = yytext.replace("->", "").trim();
return "DEFINE";
}
('"'{ID}'"') {
yytext = yytext.slice(1,-1);
return "QUOTED_ID";
}
<FINISHLINE>.* {
this.begin("INITIAL");
yytext = yytext.trim();
return "REST_OF_LINE";
}
%%
// Parser
statement
: "[" QUOTED_ID "]"
{ $$ = (defines[$2] ? defines[$2] : ""); }
| DEFINE REST_OF_LINE
{
defines[$1] = $2;
}
;
%%
var defines = {};
How can I get jison to actually tokenize and parse that saved snippet of code? Do I need to take an AST approach? Is there a way to inject the code into the parser? Should this happen in the lexing stage or the parsing stage? Would love to hear multiple strategies one could take with short example snippets.
Thanks!
If by "take an AST approach", you mean "build ASTs for the original unsubstituted program, and for the substititions, and splice them together", you're in for a hard time. There's no guarantee that your substituted string matches any valid nonterminal in your grammar, so it isn't easy to build a tree for it. Your main program before substitutions is also extremely unlikely to be parsable by your full grammar. [You can overcome these difficulties by building substring parsers and doing wizardry with tree fragment gluing, but would be a lot of work [we are doing something like this for a C preprocessor analyzer], and I doubt ANTLR would help you much].
The usual approach for this is to have the lexer keep a stack of partially read input streams, with the bottom stream being the main program, and nested streams corresponding to partially read macro invocations (you need more than one, if one macro can invoke another. Surely your language allows "fill2 -> x.1 [ fill1 ] y.3 "? This means the lexer must:
You may someday decide that you need parameters on your macros. You can typically implement these as streams, too.
You could imagine lexing the tokens and storing a token stream rather than text as the macro body; then the macro call detection and the body insertion could occur after the lexer and before the parser. Since there is presumably an interface between the two of them, putting code in between to manage this seems like a practical way to go. A complication may occur if your language allows the same stream of characters to be interpreted differently in different places in the program; how will the macro capture know how to lex the macro content in this case?
I don't know enough (or even much) about ANTLR3 to tell you how to accomplish this in detail.