This is a simplified version of a larger parser. I has been driving me slightly mad as I can't seem to get it to work consistently. The larger version works but has so many edge cases it does not work well enough, I'm finding myself having to teach python the entire syntax and all I want is the #ifdef pre-processor directives.
I tried to simplify it and got to this (below) but it never finds the ifdef / endif. Why?
from parsimonious.grammar import Grammar
CPP = Grammar(r"""
file = (code_block / nl / ws)+
label = ~"[A-z_]+\w*" # Alpha or _ followed by alphanumeric or _
ifdef = "#ifdef" ws+ label nl
else = "#else" nl
endif = "#endif" nl
line = ~r".+" nl
lines = line+
ws = ~r"[\s\t]+" # One or more whitespace
nl = ~r"[\s\t]*[\n\r]+" # Zero or more whitespace followed by a newline
if_block = ifdef code_block* endif
if_else_block = ifdef code_block* else code_block* endif
code_block = (if_block / if_else_block / lines)
""")
code = """
int a;
#ifdef test1
a += 2;
#ifdef test2
a--;
#endif
#else
a += 20;
#endif
"""
res = CPP.parse(code)
print(res)
It finds many 'lines' but never a 'if' block. (note 'code' is not suppose to do anything useful, it's just some sample text)
PEG parsing is essentially greedy, as parsimonious notes in its documentation. (This is not parsimonious-specific. You'll hit the same issue with any PEG parser.)
So you have a pattern for
code_block
:Suppose the parser is trying to match that at some point in the input. If the corresponding text starts with an
if_block
orif_else_block
(more about this later), then that's fine. But suppose it starts with something else, likeint a;
? In that case, the parser is going to try to matchlines
, which is to sayline*
, which is to say as many repetitions ofline
as possible, as per the documentation above.Now,
line
is just a line: any arbitrary sequence of characters terminated at the first newline. Soline
will happily match#ifdef test1
or#endif
or whatever.In other words, if a match for
code_block
is attempted at some point in the input and the first thing in the input is not anif_block
or anif_else_block
, thencode_block
will greedily match the rest of the input.Now, let's go back to matching
if_block
(and note that the same comment will apply to matchingif_else_block
). The rule forif_block
is:ifdef
is a line starting with#ifdef
[Note 1]. The next thing is acode_block
, which as we just saw will greedily match to the end of the input (unless it happens to immediately match a nested#ifdef
). In other words,code_block
does not care that you would like it to stop when it reaches an#endif
; it knows nothing about#endif
. It's a greedy little monster chomping its way across your input. Even if#endif
is the next line, it will get absorbed bylines
, along with the rest of the input.Since
#endif
can never be recognised,if_block
can never match and the parser will have to fall back to the next possibility,if_else_block
. But that can never match either, so it will have to fall back to the greedy monster.In short, this grammatical model cannot work, and that's not the fault of parsimonious. If you stick to this model, it doesn't matter which PEG parser you use, you will end up with the same problem.
From the above analysis, it seems clear that any repeated match for lines needs to be aware of preprocessor directives, in order to not match them [Note 2]. That can be done, but it makes things a bit more complicated because it forces us to categorise the possible inputs, and not just rely on ordered alternatives [Note 3].
So the key is that there are two kinds of input line: preprocessor directives (at least, the ones we're interested in) and everything else. (This is a radical simplification, because "everything else" includes comments and quoted literals, which need to be handled separately. But we'll ignore that for now.) It's easy to see how to recognize a preprocessor directive, so let's focus on recognising a line which is not a preprocessor directive. Fortunately, parsimonious (like many PEG parser generators) has a negative lookahead operator, which we can use to our benefit here:
(Below, there's a correction to the definitions of
ws
andnl
)There's no need for
lines
, because of another issue I didn't note above. In the definitiona
code_block
can be any number of lines but only exactly one preprocessor block. That's not right: acode_block
can be any mix of any number of lines and nested blocks. The correct definition would be:It seems to me simpler to use a conditional in order to avoid having two different preprocessor block definitions. After all, both of them start with an
#ifdef
, and there's no point trying to scan that again after a fallback, even if parsimonious manages to do the rescan efficiently by packratting. So I propose the following:(I removed the redundant
+
fromifdef
becausews
already includes a repetition operator.)At this point, a small digression to fix your regex patterns. The goal here is that
ws
matches any horizontal whitespace (excluding newline characters), whilenl
matches possible horizontal whitespace followed by a sequence of newline characters. But yourws
pattern include\s
, which in Python regexes matches any whitespace including newlines, which is a bit too generous. And while we're fixing regexes,[A-z]
is not the same as[A-Za-z]
becausea
does not immediately followZ
; they are separated by several punctuation characters (including_
, as it happens, but also[]
among others). The following isn't perfect either, but it's a start:So, to put all that together:
That seems to handle your example code, at least.
Notes
This isn't really correct, because in C a preprocessor directive can be preceded with whitespace, and there is no requirement for the
#
to be immediately followed by the directive name. But that's not the issue here.Warning: opinion follows. Contrary to the oft-repeated advertisement that PEG parsers "simplify the grammar" by not forcing a division into lexical analysis and syntax, it seems to me that in many cases (such as this one), a systematic separation of concerns would actually be the simplest.
There is a place for ordered alternatives. They work well in lexical analysis, for example. But that's well outside the scope of this answer.