Extracting specific articles and their talk pages from a Wikipedia dump

475 Views Asked by At

I am a completely new to web crawling. I have the following Wikipedia dump link https://dumps.wikimedia.org/backup-index.html. I have a list of article titles. They are all in English.

I need to download those articles and their talk pages from the given dumps. Kindly let me know where to start from.

1

There are 1 best solutions below

0
On

That depends a lot on your usecase. Do you have a relatively small set (let's say, few hundreds) of pages to fetch? Go for API, it can give you both wikitext and HTML, while the dumps will give only wikitext to you.

If you need to go dumps, or just want to learn how to deal with them the best way, https://en.wikipedia.org/wiki/Wikipedia:Database_download#How_to_use_multistream? might be a good study material.