Regular expression negative lookbehind

79 Views Asked by At

I want to have a matching regexp pattern that matches all the addresses that end in 4 or more digits, but not coming after 'APT', 'BOX', 'APT ', or 'BOX '. So it should match the following cases:

HITME 1234
HITME 12345
HITME1234

but not the following cases:

BOX 1234
BOX 12345
BOX4044
APT 1234
APT 12345
NONHIT123
NONHIT 123

I have made this one

(?<!(APT |BOX ))([0-9]{4,})$

but it does not work right. Somehow still matches the no-no cases.

4

There are 4 best solutions below

0
Gaberocksall On BEST ANSWER

TL;DR use ^(?!APT|BOX).*?([0-9]{4,})$


Your regex (?<!(APT |BOX ))([0-9]{4,})$ incorrectly matches:

  • BOX 12345 on 2345 because it is not preceded by APT or BOX . Instead, it is preceded by BOX 1
  • BOX4044 on 4044 because it is not preceded by APT or BOX . Instead, it is preceded by BOX
  • APT 12345 on 2345 for a similar reason.

The regex you're looking for is ^(?!APT|BOX).*?([0-9]{4,})$, which is broken down like so:

  • ^(?!APT|BOX) - the beginning of the string cannot be followed by APT or BOX
  • .*? - a bunch of garbage in the middle of the string, taking as few characters as possible (i.e. HITME in your test cases)
  • ([0-9]{4,})$ - the matched digits at the end of the string
4
XGG On
/(?<!(APT.+|BOX(.+)?))([0-9]{4,})$/gm
0
The fourth bird On

You can not add APT and BOX to the same alternation in the lookbehind assertion because they have to be of the same length.

Also note that your pattern would only match digits.

You can add another lookbehind with the other 2 alternatives, and assert that the match starts at a position that does not have a digit directly to the left.

We can not use a word boundary \b before matching the digits as you expect a match for HITME1234

(?<!\bAPT |\bBOX )(?<!\bAPT|\bBOX)(?<!\d)[0-9]{4,}$

Regex demo

Depending on the allowed characters before the last digits, you could make use of a negative lookahead and word boundaries to get the whole match and still get a match for BOXA 1234

The pattern asserts that from the start of the string there is not BOX or APT followed by optional spaces followed by a digit:

^(?!(?:BOX|APT)(?=[^\S\n]*\d))\w+[^\S\n]*\d{4,}$

Regex demo

3
Cary Swoveland On

You could attempt to match the following regular expression.

^(?:(?:BOX|APT) *\d{4,}|\D*\d{,3}|\D*(\d{4,}))$

Capture group 1 will contain the digits at the end of the string if the string conforms to the requirements.

Demo

The expression is an alternation with the three parts. The first two match, but do not capture, strings that are not valid. The third matches and captures the digits in a valid string. So the idea is to pay attention only to matches that capture a substring as well. (This is not my crazy invention; it's a well-worn technique that has been used since the Mesozoic Era.)

The expression can be broken down as follows.

^                     # match the beginning of the string
(?:                   # begin a non-capture group
  (?:BOX|APT) *\d{4,} # match 'BOX' or 'APT' followed by 4 or more spaces
                      # followed by 4 or more spaces
|                     # or
  \D*\d{,3}           # match 0 or more non-digits, followed by 0 to 3 digits
|                     # or
  \D*                 # match 0 or more non-digits
  (                   # begin capture group 1
    \d{4,}            # match 4 or more digits
  )                   # end capture group 1
)                     # end non-capture group
$                     # match end of string

Note all variable-length matches ( *, \D*, \d{4,} and \d{,3}) are greedy, meaning as many characters as possible are to be matched.