Negative Lookahead & Lookbehind with Capture Groups and Word Boundaries

97 Views Asked by At

We are auto-formatting hyperlinks in a message composer but would like to avoid matching links that are already formatted.

Attempt: Build a regex that uses a negative lookbehind and negative lookahead to exclude matches where the link is surrounded by href=" and ".

Problem: Negative lookbehind/lookahead are not working with our regex:

Regex:

/(?<!href=")(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9@:%._+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_+.~#?&\/\/=;]*)(?!")/g

Usage:

html.match(/(?<!")(http(s)?:\/\/.)?(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=;]*)(?!")/g);

When testing, we notice that exchanging the negative lookahead/lookbehind with a positive version causes it to work. Thus, only negative lookbehind/lookaheads are not working.

Does anyone know why these negative lookbehind/lookaheads are not functioning with this regex?

Thank you!

2

There are 2 best solutions below

0
willbeing On BEST ANSWER

With @Barmar's help in the question comments, it is clear that the problem lies in the optional beginning and end of the regex.

"Basically, anything that allows something to be optional next to a negative lookaround may negate the effect of the lookaround, if it can find a shorter match that isn't next to it. "

0
sln On

If using modern JS that supports variable length lookbehind assertions, you can utilize non-greedy variability into the lookbehind.

This allows the regex to now introduce optional beginnings like what you have.

/(?<!href="[^"]*?)(?:https?:\/\/.)?(?:www\.)?[a-zA-Z0-9#%+\-.:=@_~]{2,256}\.[a-z]{2,6}\b[a-zA-Z0-9#%&+\--\/:;=?@_~]*(?!")/

https://regex101.com/r/OdJyZf/1

 (?<! href=" [^"]*? )
 (?: https?:// . )?
 (?: www \. )?
 [a-zA-Z0-9#%+\-.:=@_~]{2,256} \. [a-z]{2,6} \b [a-zA-Z0-9#%&+\--/:;=?@_~]* 
 (?! " )

I must make a correction. In my comments I said that the word boundary \b here [a-z]{2,6}\b[a-zA-Z0-9#%&+\--/:;=?@_~] effectively removes the word class \w in the following class.

This is true but only for the first following letter. All the following chars seem to include word chars so it's needed. It's a clear example of overthinking something that does not need to be.

The whole regex should be able to be rewritten using \w in the classes unless ASCII is required.

Note that this will only work for the new JS engine and C# (of course).