I am trying to match tags like TODO
inside comments in some code files using a regular expression. Consider for example the following file:
foo bar # TODO
bar foo quux # TODO bar # TODO foo quux
quux # foo ## TODO foo # bar # TODO quux
'# TODO\'' # TODO
Note that there might be multiple tags in one line as long as each one is preceded by #
, so lines two and three should match twice. Furthermore, the prefixes before the first #
(the actual code) may have arbitrary length; the same applies to what comes after each TODO
. Apart from that there might be substrings like # TODO
that are no comments (see line four; it should match once, the # TODO
at the end).
I have been searching here on Stackoverflow and on other sites, but nothing seemed to answer a problem where you have multiple overlapping matches and a variable length prefix before those matches. I assume that the problem lies mainly in trying to use positive lookaheads/lookbehinds in conjunction with a context:
(?=#\s*TODO[^#]*)
does not work since it matches line four twice. This is why I say overlapping: It seems that you have to take the structure of the prefix into account when matching.- I can match the prefix (actual code and comments without a tag) part with
^[^#']*('[^'\\]*(\\.[^'\\]*)*'[^#']*)*(#\s*(?!TODO)[^#]*)*
so that I get line four right, but this is a variable-length match, so using a positive lookbehind like(?<=^[^#']*('[^'\\]*(\\.[^'\\]*)*'[^#']*)*(#\s*(?!TODO)[^#]*)*)(#\s*TODO[^#]*)
will result in an error on every regex engine as far as I know (and if working, would only match the first# TODO
anyways). - Matching the prefix and then using a positive lookahead like
^[^#']*('[^'\\]*(\\.[^'\\]*)*'[^#']*)*(#\s*(?!TODO)[^#]*)*(?=(#\s*TODO[^#]*)(#\s*(?!TODO)[^#]*)*)
does not work either since it matches only one occurrence of# TODO
.
To explain: \\.
matches an escaped character and [^'\\]*
anything that is not an escape character and not a string delimiter, so '[^'\\]*(\\.[^'\\]*)*'
matches any string literal. Using [^#']*
outside of that string literal part means: Match anything that does not start a string or a comment, so the code part of a line is ^[^#']*('[^'\\]*(\\.[^'\\]*)*'[^#']*)*
. A comment segment that does not contain a tag can be found with #\s*(?!TODO)[^#]*
, so the whole prefix can be matched with ^[^#']*('[^'\\]*(\\.[^'\\]*)*'[^#']*)*(#\s*(?!TODO)[^#]*)*
.
I use ripgrep
, so this applies to PCRE/PCRE2 regular expressions.
I would, however, be interested in whether there is a solution in any regex dialect.
I know that I can match each line that has at least one correct match and post-process the results in some scripting language to extract each TODO
from the lines, but I would like to know if it is possible to do this regex-only.