I have a wikipedia URL and want to extract every content image URL
I tried normal webscraping using BeautifulSoup where I put the URL and fetch for images with the class "thumbimage" to get the content images but there are pages where all images have the class "mw-file-element" including the wikipedia logo and unnecessary ones which means that the script imports everything
So I'm trying to find a new solution using either MediaWiki or the wikipedia API
I'm not familiar with MediaWiki or the Wikipedia API, but I can give you some hints about how to go with the BeautifulSoup way.
First thing, is that you want to restrict your extraction / scraping to only the 'article' portion of a wikipedia page. To do so, you can target
<main id="content">. That will excludes the header, the footer and the navbar, leaving you with images from the article only.After this, you want to check the images
altandtypeofto keep only the desired ones. The images in the article all have atypeof="mw:File/xyzvalue. You could use this to determine if you want to have this type of image or not in your result set. ie:typeof="mw:File/Thumb"are the normal thumbnail / displayed images through the article andtypeof="mw:File/Frameless"is often the 'main image' of the article, in the right box.Otherwise, you could analyse the
altproperty of those images and determine some rules that would fit with your vision of which image is right or wrong and discard them based on this.Based on those 2 points, you should be good enough to scrape every images in an article on Wikipedia without getting too much (if any) unwanted images.