I am using linkchecker to crawl the UK government website, map the relations between hyperlinks, and output to a GML file.
I do not want to include URLs of images, for example any URL that contains a jpeg or png file format reference (e.g. "www.gov.uk/somefile.jpeg").
I have tried for hours to achieve this using the --ignore-url
command line parameter and various regular expressions. Here is my final attempt before giving up:
linkchecker --ignore-url='(png|jpg|jpeg|gif|tiff|bmp|svg|js)$' -r1 --verbose --no-warnings -ogml/utf_8 --file-output=gml/utf_8/www.gov.uk_RECURSION_1_LEVEL_NO_IMAGES.gml https://www.gov.uk
Could anyone please advise if this is possible, and if so suggest a solution?
Trivia:
According to docs:
Thus we can easily check your regex with python to see why it doesn't work (live test):
Output:
And I think, that problem here because of partially match, hence let's try full match (pattern, live test):
...and output is:
Solution:
As you can see, in your attempt your URLs does not match the given regular expression and not ignored. The only things, thoose match that regex are the listed extensions (png, jpg, ...).
To overcome this problem - match all characters before extensions with
.*
. Another problem - enclosing quotes.From doc's examples:
So your final option is:
Hope it helps!