Difficulty Retrieving Highest Quality Images from Wikimedia Using get_main_image Function

43 Views Asked by At

I'm facing an issue with the get_main_image function in my Python script designed to scrape images from Wikimedia. The problem lies in the function's behaviour of downloading smaller images instead of the highest quality versions available.

Here's a brief overview of the issue:

  • The get_main_image function is responsible for retrieving and saving images from Wikimedia.
  • However, it seems to be consistently downloading smaller or lower-quality versions of the images.
  • My goal is to modify the function to ensure it retrieves the largest and clearest version of the image available on Wikimedia.

I suspect that there might be a flaw in how the function identifies and fetches the image URLs or perhaps in the selection process for the image quality.

Below is a simplified version of the get_main_image function:

import requests

def get_main_image(wiki_link,article, save_dir, IMAGE_NUM):
  headers = {
    "Authorization": f"Bearer {access_token}",
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
  }
  image_url = wiki_link + article.replace(" ", "_")
  image_name = str(IMAGE_NUM + 1)
  response = requests.get(image_url)
  soup = bs(response.text, 'html.parser')

  try:
    main_image_url = soup.find('img', alt=article).get('srcset')
    main_image_response = requests.get(url = main_image_url,headers= headers, stream=True)
  except Exception as e:
    #print(e)
    try:
      main_image_url = soup.find('img', alt=article).get('src')
      main_image_response = requests.get(url = main_image_url,headers= headers, stream=True)
    except:
      return image_url, None

  #print(article.replace(" ", "_")[5:])
  #print(article[-4:])
  if article[-4:] == ".svg":
    image = Image.open(BytesIO(main_image_response.content))
    image_name = image_name + ".png"
    save_path = save_dir + "//" + image_name 
    #print(article.replace(" ", "_")[5:] + ".png")
    image.save(save_path)
  elif article[-5:] == ".djvu":
    image = Image.open(BytesIO(main_image_response.content))
    image_name = image_name + ".jpg"
    save_path = save_dir + "//" + image_name
    #print(article.replace(" ", "_")[5:] + ".jpg")
    image.save(save_path)
  else:
    image = Image.open(BytesIO(main_image_response.content))
    image_name = image_name + article[-4:]
    save_path = save_dir + "//" + image_name
    #print(article.replace(" ", "_")[5:])
    #print("I haven't caused an error yet")
    try:
      image.save(save_path)
    except Exception as e:
      image_name = None
  return image_url, image_name

EDIT: For example, this image (i.e. the main image on the URL) is downloaded as 1.09 mb when it is 4.33 mb. https://commons.wikimedia.org/wiki/File:Map_of_Potential_Nuclear_Strike_Targets_(c._2015),_FEMA.png

1

There are 1 best solutions below

1
Andrej Kesely On

If I understand you correctly you can use Wikimedia Commons API to get URL of full sized image, for example:

import requests
from bs4 import BeautifulSoup

api_url = "https://magnus-toolserver.toolforge.org/commonsapi.php"
image_name = "Map_of_Potential_Nuclear_Strike_Targets_(c._2015),_FEMA.png"

soup = BeautifulSoup(requests.get(api_url, params={"image": image_name}).content, "xml")
# print(soup.prettify())

print(soup.urls.file.text)

Prints:

https://upload.wikimedia.org/wikipedia/commons/7/7e/Map_of_Potential_Nuclear_Strike_Targets_%28c._2015%29%2C_FEMA.png