How do you find an overlapping match with a variable-length prefix using regex?

147 Views Asked by At

I am trying to match tags like TODO inside comments in some code files using a regular expression. Consider for example the following file:

foo bar # TODO
bar foo quux   # TODO bar # TODO foo quux
quux # foo ## TODO foo # bar #  TODO quux
'# TODO\'' # TODO

Note that there might be multiple tags in one line as long as each one is preceded by #, so lines two and three should match twice. Furthermore, the prefixes before the first # (the actual code) may have arbitrary length; the same applies to what comes after each TODO. Apart from that there might be substrings like # TODO that are no comments (see line four; it should match once, the # TODO at the end).

I have been searching here on Stackoverflow and on other sites, but nothing seemed to answer a problem where you have multiple overlapping matches and a variable length prefix before those matches. I assume that the problem lies mainly in trying to use positive lookaheads/lookbehinds in conjunction with a context:

  • (?=#\s*TODO[^#]*) does not work since it matches line four twice. This is why I say overlapping: It seems that you have to take the structure of the prefix into account when matching.
  • I can match the prefix (actual code and comments without a tag) part with ^[^#']*('[^'\\]*(\\.[^'\\]*)*'[^#']*)*(#\s*(?!TODO)[^#]*)* so that I get line four right, but this is a variable-length match, so using a positive lookbehind like (?<=^[^#']*('[^'\\]*(\\.[^'\\]*)*'[^#']*)*(#\s*(?!TODO)[^#]*)*)(#\s*TODO[^#]*) will result in an error on every regex engine as far as I know (and if working, would only match the first # TODO anyways).
  • Matching the prefix and then using a positive lookahead like ^[^#']*('[^'\\]*(\\.[^'\\]*)*'[^#']*)*(#\s*(?!TODO)[^#]*)*(?=(#\s*TODO[^#]*)(#\s*(?!TODO)[^#]*)*) does not work either since it matches only one occurrence of # TODO.

To explain: \\. matches an escaped character and [^'\\]* anything that is not an escape character and not a string delimiter, so '[^'\\]*(\\.[^'\\]*)*' matches any string literal. Using [^#']* outside of that string literal part means: Match anything that does not start a string or a comment, so the code part of a line is ^[^#']*('[^'\\]*(\\.[^'\\]*)*'[^#']*)*. A comment segment that does not contain a tag can be found with #\s*(?!TODO)[^#]*, so the whole prefix can be matched with ^[^#']*('[^'\\]*(\\.[^'\\]*)*'[^#']*)*(#\s*(?!TODO)[^#]*)*.

I use ripgrep, so this applies to PCRE/PCRE2 regular expressions. I would, however, be interested in whether there is a solution in any regex dialect.

I know that I can match each line that has at least one correct match and post-process the results in some scripting language to extract each TODO from the lines, but I would like to know if it is possible to do this regex-only.

0

There are 0 best solutions below