RegEx matching URLs that are NOT in my domain

1.8k Views Asked by At

I am trying to set up my Netscaler device with a Rewrite Policy. One of my requirements is to replace any non-domain URLs with the home page URL... that is, I want the Netscaler to replace all external links on a page being served from behind the device with the home page's URL (ex: https://my.domain.edu). The type of Rewrite Policy I'm trying to configure uses a PCRE-compliant regex engine to find specific text on a web page (multiple matches possible).

good links:

https://your.page.domain.edu -- won't be replaced  
http://good.domain.edu  -- also won't be replaced

bad links (should be replaced with home page URL):

https://www.google.com    
http://not.the.best.example.org   
http://another.bad.example.erewhon.edu   
https://my.domain.com    

I currently have this pattern:

(https?://)(?![\w.-]+\.domain\.edu)

According to the Netscaler's RegEx evaluation tool this matches the bad links above and doesn't match the good links, so it seems to be working... in fact, when I run this on a test page, the Netscaler finds all the URLs I want to replace and leaves the good URLs alone.

The problem is the Netscaler isn't replacing the URLs the way I want: it replaces the (https?://) group with the home page URL but leaves the remaining part of the bad URL. For example, it replaces http://www.google.com with: https://my.domain.eduwww.google.com

I can configure the Rewrite Policy to replace specific URLs (for example, https://www.google.com), so I know the mechanism works. Obviously, this won't work for the general case.

I've tried enclosing the entire regex in parentheses, but this didn't change anything.

Can a regular expression be written for the general case, to match the entire URL for all domains that aren't mine?

Thanks in advance for any help!

2

There are 2 best solutions below

2
On BEST ANSWER

You can use the following regex:

^https?:\/\/[\w.-]+(?<!\.domain\.edu)$

with your home page URL as substitution:

https://my.domain.edu

TEST INPUT:

https://www.google.com
http://not.the.best.example.org
http://another.bad.example.erewhon.edu
https://my.domain.com
https://your.page.domain.edu
http://good.domain.edu

TEST OUTPUT:

https://my.domain.edu
https://my.domain.edu
https://my.domain.edu
https://my.domain.edu
https://your.page.domain.edu
http://good.domain.edu

Demo on regex101

If http/https matters than use the following regex:

^(https?:\/\/)[\w.-]+(?<!\.domain\.edu)$

with replacement:

\1my.domain.edu

INPUT:

https://www.google.com
http://not.the.best.example.org
http://another.bad.example.erewhon.edu
https://my.domain.com
https://your.page.domain.edu
http://good.domain.edu

OUTPUT:

https://my.domain.edu
http://my.domain.edu
http://my.domain.edu
https://my.domain.edu
https://your.page.domain.edu
http://good.domain.edu

Demo2

0
On

Look at the raw http payload and make sure the links are as you belive them to be in the actual payload..

hostname are usually a http header, protocol is very often not included in the page content etc.. install fiddler and observe the raw data.

Netscaler RegEx works as intended.

Further: make sure to deflate any compressed content prior to trying to rewrite it. if not the netscaler will try to match your rewrites with the compressed data / chunked content.