How to get Ragel to parse two names separated by (space* ":" space*)?

1k Views Asked by At

I'd like to parse the following:

name:name

where the names start and end with an alnum, and can contain any combination of alnum and spaces inside. They could also be blank. My rules for this are:

identifier = alnum (space* alnum)*;
name       = (identifier | zlen) >sName $pName %fName;

The names can be separated by a colon and optionally spaces inbetween the names and the colon. My rules for this are:

sep = space* ":" space*;
main := name sep name;

This doesn't work because apparently the space* in identifier and the space* in sep confuse the parser. I end up getting the action fName executed in every space of the name.

If I change sep to:

sep = ":";

then everything is fine. How do I modify these rules so that the parser does what I intend?

Source code for this question here: https://gist.github.com/1661150

1

There are 1 best solutions below

2
On

There are two basic solutions to this kind of problem.

  1. Define the action so it can be executed safely multiple times,
  2. Change the syntax so the action is only executed once.

In this case, I would choose a hybrid approach. Use actions to record the start and ending positions of a name: these actions can be executed safely many times since they just record locations. Once you are sure you are past the name, execute a different action that will only execute once.

/* C code */
char *name_start, *name_end;

/* Ragel code */
action markNameStart { name_start = p; }
action markNameEnd { name_end = p; }
action nameAction {
    /* Clumsy since name is not nul-terminated */
    fputs("Name = ", stdout);
    fwrite(name_start, 1, name_end - name_start, stdout);
    fputc('\n', stdout);
}

name = space* %markNameStart
       (alnum+ %markNameEnd <: space*)+
       %nameAction ;
main := name ":" name ;

Here, the syntax for name includes arbitrary spaces and at least one alphanumeric character. When the first alphanumeric character is encountered, its location is saved in name_start. Whenever run of alphanumeric characters ends, the location of the following character is saved in name_end. The <: is technically unnecessary but it reduces how often the markNameEnd action is executed.

Just be sure not to place such an expression next to any spaces.

I have not tested the above code. You should look at the Graphviz visualization of the state machine before you use.

What Ragel is doing

With your original code, let's suppose the input is as follows:

Hello world : Goodbye world

The Ragel machine scans from left to right, finds the start of a name, and scans over the alphanumeric characters.

Hello world : Goodbye world
    ↑

The next character is a space. So either we have encountered a space inside a word, or the first space after the end of a word. How does Ragel choose?

Ragel chooses both options, at the same time. This is very important. Ragel is trying to simulate a nondeterministic finite automaton, but since your computer is deterministic, the easiest way to do that is to convert the NFA to a DFA which simulates an unlimited number of NFAs in parallel. Since the NFAs have a finite number of states (hence the name), the DFAs also have a finite number of states, so this technique works.

After encountering the space, you have one NFA in the following state, looking for the rest of the name:

identifier = alnum (space* alnum)*;
                    ↑

main := name sep name;
        ↑

The second NFA is in the following state, and it assumes that the name has already ended (and this NFA executes the fName action "prematurely"):

sep = space* ":" space*;
      ↑

main := name sep name;
             ↑

It's obvious to you and it's obvious to me that only the first NFA is correct. But machines created with Ragel only look at one character at a time, they don't look ahead to see which option is correct. The second NFA will eventually encounter an alphanumeric character where it was expecting to see ":", and since that's not allowed, the second NFA will disappear.

A look at the Ragel documentation

Here's the description of %:

expr % action

The leaving action operator queues an action for embedding into the transitions that go out of a machine via a final state.

The action gets executed for transitions that do not necessarily contribute to a successful parse. See the Ragel guide, Chapter 4, "Controlling Nondeterminism" for more information about nondeterminism in Ragel, although the techniques in Chapter 4 won't help you in this particular case since the actions in your machine can only be disambiguated with unbound lookahead, which is not allowed in finite state machines.