I'd like to parse the following:
name:name
where the names start and end with an alnum, and can contain any combination of alnum and spaces inside. They could also be blank. My rules for this are:
identifier = alnum (space* alnum)*;
name = (identifier | zlen) >sName $pName %fName;
The names can be separated by a colon and optionally spaces inbetween the names and the colon. My rules for this are:
sep = space* ":" space*;
main := name sep name;
This doesn't work because apparently the space*
in identifier
and the space*
in sep
confuse the parser. I end up getting the action fName
executed in every space of the name.
If I change sep to:
sep = ":";
then everything is fine. How do I modify these rules so that the parser does what I intend?
Source code for this question here: https://gist.github.com/1661150
There are two basic solutions to this kind of problem.
In this case, I would choose a hybrid approach. Use actions to record the start and ending positions of a
name
: these actions can be executed safely many times since they just record locations. Once you are sure you are past the name, execute a different action that will only execute once.Here, the syntax for
name
includes arbitrary spaces and at least one alphanumeric character. When the first alphanumeric character is encountered, its location is saved inname_start
. Whenever run of alphanumeric characters ends, the location of the following character is saved inname_end
. The<:
is technically unnecessary but it reduces how often themarkNameEnd
action is executed.Just be sure not to place such an expression next to any spaces.
I have not tested the above code. You should look at the Graphviz visualization of the state machine before you use.
What Ragel is doing
With your original code, let's suppose the input is as follows:
The Ragel machine scans from left to right, finds the start of a
name
, and scans over the alphanumeric characters.The next character is a space. So either we have encountered a space inside a word, or the first space after the end of a word. How does Ragel choose?
Ragel chooses both options, at the same time. This is very important. Ragel is trying to simulate a nondeterministic finite automaton, but since your computer is deterministic, the easiest way to do that is to convert the NFA to a DFA which simulates an unlimited number of NFAs in parallel. Since the NFAs have a finite number of states (hence the name), the DFAs also have a finite number of states, so this technique works.
After encountering the space, you have one NFA in the following state, looking for the rest of the
name
:The second NFA is in the following state, and it assumes that the
name
has already ended (and this NFA executes thefName
action "prematurely"):It's obvious to you and it's obvious to me that only the first NFA is correct. But machines created with Ragel only look at one character at a time, they don't look ahead to see which option is correct. The second NFA will eventually encounter an alphanumeric character where it was expecting to see
":"
, and since that's not allowed, the second NFA will disappear.A look at the Ragel documentation
Here's the description of
%
:The action gets executed for transitions that do not necessarily contribute to a successful parse. See the Ragel guide, Chapter 4, "Controlling Nondeterminism" for more information about nondeterminism in Ragel, although the techniques in Chapter 4 won't help you in this particular case since the actions in your machine can only be disambiguated with unbound lookahead, which is not allowed in finite state machines.