Is there any way to extract images from wikipedia based on the URL using MediaWiki

334 Views Asked by At

I have a wikipedia URL and want to extract every content image URL

I tried normal webscraping using BeautifulSoup where I put the URL and fetch for images with the class "thumbimage" to get the content images but there are pages where all images have the class "mw-file-element" including the wikipedia logo and unnecessary ones which means that the script imports everything So I'm trying to find a new solution using either MediaWiki or the wikipedia API

2

There are 2 best solutions below

0
XceeD On

I'm not familiar with MediaWiki or the Wikipedia API, but I can give you some hints about how to go with the BeautifulSoup way.

  1. First thing, is that you want to restrict your extraction / scraping to only the 'article' portion of a wikipedia page. To do so, you can target <main id="content">. That will excludes the header, the footer and the navbar, leaving you with images from the article only.

  2. After this, you want to check the images alt and typeof to keep only the desired ones. The images in the article all have a typeof="mw:File/xyz value. You could use this to determine if you want to have this type of image or not in your result set. ie: typeof="mw:File/Thumb" are the normal thumbnail / displayed images through the article and typeof="mw:File/Frameless" is often the 'main image' of the article, in the right box.

  3. Otherwise, you could analyse the alt property of those images and determine some rules that would fit with your vision of which image is right or wrong and discard them based on this.

Based on those 2 points, you should be good enough to scrape every images in an article on Wikipedia without getting too much (if any) unwanted images.

0
InSync On

MediaWiki has an API for retrieving images used on a page; just use it:

import requests


api_endpoint = 'https://en.wikipedia.org/w/api.php'

response = requests.get(api_endpoint, {
  'action': 'query',
  'prop': 'images',
  'titles': 'Alan Turing',
  'format': 'json',
  'formatversion': 2
}).json()
images = response['query']['pages'][0]['images']

print(images)

'''
[
  {'ns': 6, 'title': 'File:20130808 Kings College Front Court Fountain Crop 03.jpg'},
  {'ns': 6, 'title': 'File:Alan Turing (5025990183).jpg'},
  {'ns': 6, 'title': 'File:Alan Turing 78 High Street Hampton blue 
plaque.jpg'},
  # ...
]
'''

However, this only returns the names of the files, not their URLs. If you need those, make another query:

response = requests.get(api_endpoint, {
  'action': 'query',
  'titles': '|'.join(image['title'] for image in images),
  'prop': 'imageinfo',
  'iiprop': 'url',
  'format': 'json',
  'formatversion': 2
}).json()

image_1_info = response['query']['pages'][0]['imageinfo']

print(image_1_info)

'''
[
  {
    'url': 'https://upload.wikimedia.org/wikipedia/commons/4/46/20130808_Kings_College_Front_Court_Fountain_Crop_03.jpg',
    'descriptionurl': 'https://commons.wikimedia.org/wiki/File:20130808_Kings_College_Front_Court_Fountain_Crop_03.jpg',
    'descriptionshorturl': 'https://commons.wikimedia.org/w/index.php?curid=27976001'
  }
]
'''