Whitespace-important parsing with Parse::RecDescent (eg. HAML, Python)

180 Views Asked by At

I'm trying to parse HAML (haml.info) with Parse::RecDescent. If you don't know haml, the problem in question is the same as parsing Python - blocks of syntax are grouped by the indentation level.

Starting with a very simple subset, I've tried a few approaches but I think I don't quite understand either the greediness or recursive order of P::RD. Given the haml:

%p
  %span foo

The simplest grammar I have that I think should work is (with bits unnecessary for the above snippet):

<autotree>

startrule           : <skip:''> block(s?)
non_space           : /[^ ]/
space               : ' '
indent              : space(s?)
indented_line       : indent line
indented_lines      : indented_line(s) <reject: do { Perl6::Junction::any(map { $_->level } @{$item[1]}) != $item[1][0]->level }>
block               : indented_line block <reject: do { $item[2]->level <= $item[1]->level }>
                    | indented_lines
line                : single_line | multiple_lines
single_line         : line_head space line_body newline | line_head space(s?) newline | plain_text newline

# ALL subsequent lines ending in | are consumed
multiple_lines      : line_head space line_body continuation_marker newline continuation_line(s)
continuation_marker : space(s) '|' space(s?)
continuation_line   : space(s?) line_body continuation_marker

newline      : "\n"
line_head    : haml_comment | html_element
haml_comment : '-#'
html_element : '%' tag

# TODO: xhtml tags technically allow unicode
tag_start_char : /[:_a-z]/i
tag_char       : /[-:_a-z.0-9]/i
tag            : tag_start_char tag_char(s?)

line_body    : /.*/
plain_text   : backslash ('%' | '!' | '.' | '#' | '-' | '/' | '=' | '&' | ':' | '~') /.*/ | /.*/
backslash    : '\\'

The problem is in the block definition. As above, it does not capture any of the text, though it does capture the following correctly:

-# haml comment
%p a paragraph

If I remove the second reject line from the above (the one on the first block rule) then it does capture everything, but of course incorrectly grouped since the first block will slurp all lines, irrespective of indentation.

I've also tried using lookahead actions to inspect $text and a few other approaches with no luck.

Can anyone (a) explain why the above doesn't work and/or (b) if there's an approach without using perl actions/rejects? I tried grabbing the number of spaces in the indent, and then using that in an interpolated lookahead condition for the number of spaces in the next line, but I could never quite get the interpolation syntax right (since it requires an arrow operator).

1

There are 1 best solutions below

0
On

Far better off doing some of the work outside of PRD.

my @stack = [ -1, [{}] ];
while (<>) {
   chomp;
   s/^( *)//;
   my $indent = length($1);

   if ($indent < $stack[-1][0]) {
      pop @stack while $indent < $stack[-1][0];
      die "Indent mismatch\n" if $indent != $stack[-1][0];
   }
   elsif ($indent > $stack[-1][0]) {
      my $children = $stack[-1][1][-1]{children} = [];
      push @stack, [ $indent, $children ];
   }

   push @{ $stack[-1][1] }, $parser->parse_line($_);
}

die "Empty document\n" if !$stack[0][1][0]{children};
die "Multiple roots\n" if @{ $stack[0][1][0]{children} } > 1;

my $root = $stack[0][1][0]{children}[0];

$parser->parse_line($_) is expected to return a hash ref.