Checking file exists before download using head

2.9k Views Asked by At

I have a python script that will search a page source and download any files it finds in the source.

However, the script will actually download files that do not exist (dead links).

I done a bit of research and found that this can be overcome using HEAD which provides error codes without the need to download the file or something along these lines.

Basically, I want to check if the server returns 404. If it does then I the files doesn't exist and I don't want to download it.

I've found the following code which seems would work but it needs some alterations to work with my script..

c = httplib.HTTPConnection(<hostname>)
c.request("HEAD", <url>)
print c.getresponse().status 

urllib.urlretrieve(test, get)

should equal the website (http://google.com) should equal the file (/file1.pdf)

I need this code to work so that it only needs the URL: http://google.com/file1.pdf to work..

Is there anyway I can do this?

Code was taken from here: How do I check the HTTP status code of an object without downloading it?

2

There are 2 best solutions below

0
On
import httplib    

file = "http://google.com/file1.pdf"

c = httplib.HTTPConnection("google.com")
c.request("HEAD", file)
if c.getresponse().status == 200:
  download(file)
1
On

Above didn't seem to work :(

I managed to resolve it though!

#Gets the header code and stores in status
status = urllib.urlopen(test).getcode()
print status #Prints status, testing purposes

#if status code is equal to 200 (OK)
  if status == 200:
      urllib.urlretrieve(test, get) #download the file
      print 'The file:', doc, 'has been saved to:', get #display success message 
  elif status == 404: #if status is equal to 404 (NOT FOUND) 
      print 'The file:', doc, 'could not be saved. Does not exist!!' #display error
  else: #Any other message then display error and the status code
      print 'Unknown Error:', status