REGEXP: capture group NOT followed by

3.4k Views Asked by At

I need to match following statements:

Hi there John
Hi there John Doe (jdo)

Without matching these:

Hi there John Doe is here 
Hi there John is here

So I figured that this regexp would work:

^Hi there (.*)(?! is here)$

But it does not - and I am not sure why - I believe this may be caused by the capturing group (.*) so i thought that maybe making * operator lazy would solve the problem... but no. This regexp doesn't work too:

^Hi there (.*?)(?! is here)$

Can anyone point me in the solutions direction?

Solution

To retrieve sentence without is here at the end (like Hi there John Doe (the second)) you should use (author @Thorbear):

^Hi there (.*$)(?<! is here)

And for sentence that contains some data in the middle (like Hi there John Doe (the second) is here, John Doe (the second) being the desired data)simple grouping would suffice:

^Hi there (.*?) is here$

.

           ╔══════════════════════════════════════════╗
           ║▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒║
           ║▒▒▒Everyone, thank you for your replies▒▒▒║
           ║▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒║
           ╚══════════════════════════════════════════╝
3

There are 3 best solutions below

8
On BEST ANSWER

the .* will find a match regardless of being greedy, because at the end of the line, there is no following is here (naturally).

A solution to this could be to use lookbehind instead (checking from the end of the line, if the past couple of characters matches with is here).

^Hi there (.*)(?<! is here)$

Edit

As suggested by Alan Moore, further changing the pattern to ^Hi there (.*$)(?<! is here) will increase the performance of the pattern because the capturing group will then gobble up the rest of the string before attempting the lookbehind, thus saving you of unnecessary backtracking.

0
On

It's not entirely clear from your example if you want to prevent " is here" from occurring anywhere or just at the end of a line. If it should not occur anywhere, try this:

^Hi there ((?! is here).)*$

Before each character, it checks to see that the next characters are not " is here".

Alternatively, if you only want to exclude it if it occurs at the very end of a line, you could use a negative lookbehind as Thorbear suggested:

^Hi there (.*)(?<! is here)$ 

You're absolutely right why your expression matched all of the input lines. .* matched everything, and the lookahead (?! is here)$ would always be true because " is here" would never occur after the end of a line (because nothing will be there).

5
On

You don't need to solve your problem with regex, you merely need to use regex to find out if the non-intended regex matches. Of course, if you already know this and are simply looking to learn about lookaheads/lookbehinds, you can discard the rest of this answer.

If you take the regex you don't want your input strings to match:

badregex = (Hi there (.*)(is here))

This will give you a match for

Hi there, John is here

So you can just put the logic at application level, where it should be (logic in regexes is a bad bad thing). A bit of pseudocode (I cba write out Java right now, but you get the idea)

if (badregex.exactMatch(your_str))
   discardString();
   return;
if (goodregex.exactMatch(your_str))
   doStuff(your_str);