Downloading and renaming images from multiple URL with the same file name

136 Views Asked by At

I am trying to download images from an archive. I have the image URLs and am able to successfully download each file using the code below. However some of the images use the same name (e.g compressed.jpg) so when running the command only one compressed.jpg file is created.

I want to be able to rename these files on download so I end up with compressed1.jpg, compressed2.jpg etc. I am very new to Python so am getting myself into a complete mess trying to add incremental numbers to the end of the file names.

Thank you

import requests    
image_url =[
  'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/975/thumbnail/compressed.jpg',
  'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/105/093/thumbnail/compressed.jpg',
  'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/984/thumbnail/compressed.jpg',
  'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/107/697/thumbnail/compressed.jpg'
]
for img in image_url:     
     file_name = img.split('/')[-1]     
     print("Downloading file:%s"%file_name)    
     r = requests.get(img, stream=True)      
     # this should be file_name variable instead of "file_name" string    
     with open(file_name, 'wb') as f:    
         for chunk in r:    
             f.write(chunk)    

I have tried using os and glob to rename but no luck - how can I get the files to rename before being downloaded?

3

There are 3 best solutions below

0
On BEST ANSWER

You just add an index to the filename. To get the index from your for loop you use enumerate on the image_url list. You then split the filename to get a list of name and extension which you can use to add the index number.

import requests
import os.path

image_url = [
    'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/975/thumbnail/compressed.jpg',
    'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/105/093/thumbnail/compressed.jpg',
    'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/984/thumbnail/compressed.jpg',
    'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/107/697/thumbnail/compressed.jpg'
]
for index, img in enumerate(image_url):
    file_name_string = img.split('/')[-1]
    file_name_list = os.path.splitext(file_name_string)
    target_file = f"{file_name_list[0]}{index + 1}{file_name_list[1]}"
    print("Downloading file:%s" % target_file)
    r = requests.get(img, stream=True)
    with open(target_file, 'wb') as f:
        for chunk in r:
            f.write(chunk)
0
On

You can maintain a counter for each image and appending it to the file name:

import requests
import os

image_url = [
    'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/975/thumbnail/compressed.jpg',
    'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/105/093/thumbnail/compressed.jpg',
    'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/984/thumbnail/compressed.jpg',
    'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/107/697/thumbnail/compressed.jpg'
]

for i, img in enumerate(image_url, start=1):
    file_name = img.split('/')[-1]
    
    # Get the file extension
    file_extension = os.path.splitext(file_name)[-1]
    
    # Rename the file with an incremental number
    new_file_name = file_name + str(i) + file_extension 
    
    print("Downloading file: %s" % new_file_name)
    r = requests.get(img, stream=True)
    
    with open(new_file_name, 'wb') as f:
        for chunk in r:
            f.write(chunk)
0
On

If all those URLs are coming from a common prefix, I'd be tempted to just use the suffix with the slashes turned into something else. I'd also use some error checking to make sure the request worked.

The following code will download files to names like 000_103_975_thumbnail_compressed.jpg:

import requests
import pathlib

image_urls =[
  'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/975/thumbnail/compressed.jpg',
  'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/105/093/thumbnail/compressed.jpg',
  'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/984/thumbnail/compressed.jpg',
  'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/107/697/thumbnail/compressed.jpg'
]
prefix = 'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/'

for url in image_urls:
    # turn the url into something suitable for local use
    out = pathlib.Path(url.removeprefix(prefix).replace('/', '_'))

    # no point fetching something we've already got
    # you can delete the file to retry if you really want that
    if out.exists():
        print(f"already saved {url} as {out}")
        continue

    # open the file early, failures will result in an empty file and hence won't be retried
    with open(out, 'wb') as fd, requests.get(url, stream=True) as resp:
        # don't want to save HTTP 404 or 501, leave these empty
        if not resp.ok:
            print(f"HTTP server error while fetching {url}:", resp)
            continue
        for chunk in resp.iter_content(2**18):
            fd.write(chunk)
        print(f"{url} saved to {out}")