Is this kind of regex possible without negative lookahead?

89 Views Asked by At

Basically the regex im looking to create is something that would match every domain google except google.com and google.com.au

So google.org google.uk or google.com.pk would be a match. Im working within the limitations of re2 and the best i've been able to come up with is

google\.([^c][^o][^m]\.?[^a]?[^u]?)

This doesnt work for the extended domains like google.com.pk and it doesnt work if the root is double digit eg. .cn instead of .org etc

It works if there's no extended domain and the root isnt two digit google.org matches google.com doesnt match

Here's the link with test cases. regexr.com/7rbkn

Im looking for a workaround for negative lookahead. Or whether its possible to accomodate this within a single regex string.

2

There are 2 best solutions below

3
InSync On BEST ANSWER

Sure you can. The pattern will look a bit ugly, but what you are asking for is totally possible.

Let's assume that the input already satisfy the regex google(?:\.[a-z]+)+ (i.e. google followed by at least one domain names) for ease of explanation. If you want more precision, see this answer.

Match a name that is not a given name

The inverted of com would be:

  • A name that is shorter or longer than 3, or
  • A name of length 3 whose:
    • The first character is not c, or
    • The second character is not o, or
    • The third character is not m.

Translate that to regex and we have:

\A                    # This means "at the very start"
(?:
  [a-z]{1,2} |
  [a-z]{4,} |

  [^c.][a-z]{2} |     # Also exclude the dot,
  [a-z][^o.][a-z] |   # otherwise 'google.c.m'
  [a-z]{2}[^m.]       # would not match
)
\z                    # This means "at the very end"

The same applies to au:

\A(?:[a-z]|[a-z]{3,}|[^a.][a-z]|[a-z][^u.])\z

Match a hostname that is not a given hostname

There are two cases you want to avoid: google.com and google.com.au. The inverted of that would be the union of the following cases:

  • 1 extra names:
    • google.* where * is any name but com
  • 2 extra names:
    • google.*.* where the first * is any name but com, or
    • google.com.* where * is any name but au
  • 3 extra names or more: google.*.*.* ...

Or, a bit more logical:

  • If the first name is not com, it doesn't matter how many names are left.
    • Any hostname following this pattern already differs from our exclude cases by one name.
  • If the first name is com and the second name is not au, the rest of the names are also irrelevant.
    • ...for the exact same reason above.
  • If the first and second names are com and au correspondingly, then there must be at least one other name, which means there are at least three extra names.
    • ...and if there are three extra names, then we don't need to check the first and the second at all.

That said, we only need three branches. Let com be the inverted of com, here's what the pattern looks like in pseudo-regex:

\A
(?:
  google\.com    (?:\.[a-z]+)*   |
  google\.com\.au(?:\.[a-z]+)*   |
  google         (?:\.[a-z]){3,}
)
\z

See the common parts? We can extract them out:

\A
google
(?:
  \.com          |
  \.com\.au      |
  (?:\.[a-z]){3}
)
(?:\.[a-z]+)*
\z

Insert what we had from section 1, and voilà.

The final pattern

\A
google
(?:
  # google.com
  \.
  (?:
    [a-z]{1,2} | [a-z]{4,} |
    [^c.][a-z]{2} |
    [a-z][^o.][a-z] |
    [a-z]{2}[^m.]
  )
|
  # google.com.au
  \.com\.
  (?:
    [a-z] | [a-z]{3,} |
    [^a.][a-z] | [a-z][^u.]
  )
|
  # google.*.*.*
  (?:\.[a-z]+){3}
)
(?:\.[a-z]+)*
\z

Try it on regex101.com: PCRE2 with comments, Go, multiline mode.

0
dawg On

With re2, where lookarounds are not available, you could use a "sacrificial match" as in THIS trick to match what you do not want but capture what you do want.

/(?:\bgoogle\.com$)|(?:\bgoogle\.com\.au$)|(\bgoogle\.\S*)/

if alone on a line or

/(?:\bgoogle\.com\s)|(?:\bgoogle\.com\.au\s)|(\bgoogle\.\S*\s)/

if followed by spaces...

Demo

Limitations:

  • Because /google\.com/ will match 'google.com.pk' the match items either need to be stand alone on a line or followed by a space that makes it obvious that '.pk' is not at the end. You can't use \b to detect the break between .com and .com.pk since \b is true in both cases.
  • You need to have logic available to detect whether the match resulted in a capture. If the "sacrificial" matches on the left match first, the capture group will be empty. If the capture group is not empty, you have a match that is not one of the sacrificial matches.