Finding URL Instances Using Regex (Rubular)

130 Views Asked by At

I'm finding it really difficult to create a regex (rubular) syntax that I can use with our crawler to pull all the URLs that end with the word 'download'. Could you please help? Thanks so much!

Here are the URLs to match

https://www.example.com/folder1/download
https://www.example.com/folder1/download/
https://www.example.com/folder1/folder2/download?cmp=abc

Notes: i. There ca be more than one folders before the ending word ii. The ending word can have a query string attached to it or a forward slash iii. The URLs are mostly relative URLs. But it would be really better if the regex matches absolute URLs, URLs without either protocols specified, with or without the www part as well.

Ex.
<a href="/product-category/product-name/download">Download Tool</a>
Or
<a href="https://www.example.com/product-category/product-name/download">Download Tool</a>
Or
<a href="http://www.example.com/product-category/product-name/download">Download Tool</a>
Or
<a href="www.example.com/product-category/product-name/download">Download Tool</a>
Or
<a href="example.com/product-category/product-name/download">Download Tool</a>

Although most of the above would end up in a 301 redirect or cannot be considered as a valid URL, it would still be great to find such anomalies as part of this crawl.

Crawler background: This is the regex setting tab - https://www.screencast.com/t/LJsKobubg3 This is one of the custom crawl I managed to run in the past using regex with the help of the Dev team (who's unreachable now) - https://www.screencast.com/t/9mT2pSoP7sI This is how the end result would look - https://www.screencast.com/t/MC5MNaJXi

The end result is a spreadsheet that shows all the soruce pages + URL matches.

I was given with a regex as this but this doesn't match the relative URLs and also pulls all the surrounding HTML texts in the end result report, not only the URL. https://regex101.com/r/5nHp8s/1

Once again thanks so much for helping me.

0

There are 0 best solutions below