Matching nested constructs in TextMate / Sublime Text / Atom language grammars

1.8k Views Asked by At

While writing a grammar for Github for syntax highlighting programs written in the Racket language, I have stumbled upon a problem.

In Racket #| starts a multiline comment and |# ends it.

The problem is that multiline comments can be nested:

  #| a comment  #| still a comment |# even 
                                      more comment |#

Here is my non-working attempt:

repository:
  multilinecomment: 
    begin:         \#\|
    end:           \|\#
    name:          comment
    contentName:   comment
    patterns:
    - include:     "#multilinecomment"
      name:        comment
    - match:       ([^\|]|\|(?=[^#]))*
      name:        comment

The intent of the match patterns are:

  1. "#multilinecomment" A multiline comment can contain another multiline comment.
  2. ([^\|]|\|(?=[^#]))* The meaning of the subexpressions:

     [^\|]        any characters not an `|`
     \|(?=[^#])   an `|` followed by a non-`#`
    

The entire expression thus matches a string not containg |#

Update:

Got an answer from Allan Odgaard on the TextMate mailing list:

http://textmate.1073791.n5.nabble.com/TextMate-grammars-and-nested-multiline-comments-td28743.html

4

There are 4 best solutions below

2
On BEST ANSWER

So I've tested a bunch of languages in Sublime that have multiline comments (C/C++, Java, HTML, PHP, JavaScript), and none of the language syntaxes support multiline comments embedded in multiline comments - the syntax highlighting for the comment scope ends with the first "comment close" marker, not with symmetric markers. Now, this isn't to say that it's impossible, because the BracketHighlighter plugin works great for matching symmetric tags, brackets, and other markers. However, it's written in Python, and uses custom logic for its matching algorithms, something that may not be available in the Oniguruma engine that powers Sublime's syntax highlighter, and apparently Github's as well.

Basically, from your description of the problem, you need a code parser to ensure that nested comments are legal, something you can't do with just a syntax highlighting definition. If you're writing this just for Sublime, a custom plugin could take care of that, but I don't know enough about Github's Linguist syntax highlighting system to say if you're allowed to do that. I'm not a regex master yet, but it seems to me that it would be rather difficult to achieve this purely by regex, as you'd need to somehow keep track of an arbitrary number of internal symmetric "open" and "close" markers before finding (and identifying!) the final one.

Sorry I couldn't provide a definitive answer other than I'm not sure this is possible, but that's the best I can come up with without knowing more about Sublime's and Github's internals, something that (at least in Sublime's case) won't happen unless it's open-sourced. Good luck!

1
On

Old post, and I don't have the reputation for a comment, but it is emphatically NOT possible to detect arbitrarily nested comments using purely regular expressions. Intuitively, this is because all regular expressions can be transformed into a finite state machine, and keeping track of nesting depth requires a (theoretically) infinite amount of state (the number of states needs to be equal to at least the different possible nesting depths, which here is infinite).

In practice this number grows very slowly, so if you don't want to go to too much trouble you could probably write something that allows nesting up to a reasonable depth. Otherwise you'll probably need a separate phase that parses through and finds the comments to tell the syntax highlighter to ignore them.

1
On

You had the correct idea but it looks like your second pattern also matches for the "begin nested comment" sequence #| which will never give a chance for your recursive #multilinecomment pattern to kick in.

All you have to do is replace your second pattern with something similar to

(#(?=[^|])|\|(?=[^#])|[^|#])+
0
On

Take the last match out. You do not need it. Its redundant to what textmate will do naturally, which is to match all additional text in to the comment scope until the end marker comes along, or the entire pattern recurses upon itself.