I have a python script that will search a page source and download any files it finds in the source.
However, the script will actually download files that do not exist (dead links).
I done a bit of research and found that this can be overcome using HEAD which provides error codes without the need to download the file or something along these lines.
Basically, I want to check if the server returns 404. If it does then I the files doesn't exist and I don't want to download it.
I've found the following code which seems would work but it needs some alterations to work with my script..
c = httplib.HTTPConnection(<hostname>)
c.request("HEAD", <url>)
print c.getresponse().status
urllib.urlretrieve(test, get)
should equal the website (http://google.com) should equal the file (/file1.pdf)
I need this code to work so that it only needs the URL: http://google.com/file1.pdf to work..
Is there anyway I can do this?
Code was taken from here: How do I check the HTTP status code of an object without downloading it?