What is the best way to expand the wikitexts of a full Wikipedia dump?

842 Views Asked by At

It is easy to download dumps of Wikipedia in XML format. However, the content of the articles are written in wikitext, which has a template system. To extract clean full texts from these dumps, it is necessary to expand these templates. Wikipedia provides an API to do so but it is not suitable for expanding an entire dump. Several scripts can be found to deal with wikitext, such as this one written in python, but they all seems outdated or simply don't deal with templates. Another way of tackling this problem would be to run Wikimedia on a computer and use the API:Expandtemplates but it seems to be a quite cumbersome solution. Finally, HTML dumps also exist, but I prefer to work with expanded wikitexts since it makes it easier to deal with wikilinks, tables, sections etc.

My goal here is to extract clean texts while keeping the wikilinks and discarding complicated templates such as info-boxes. Do you have any idea how to tackle this template expansion problem ?

2

There are 2 best solutions below

1
Robin On

I made a solution that uses Kiwix to get clean texts from Wikipedia. The HTML produced by Kiwix seems easy to parse for my purpose. I don't make the code available anymore (didn't have time to make something shareable). But it turned out to be effective and fast to implement.

0
Palmik On

I believe that https://github.com/tatuylonen/wikitextprocessor/ does what you want:

This is a Python package for processing WikiMedia dump files for Wiktionary, Wikipedia, etc., for data extraction, error checking, offline conversion into HTML or other formats, and other uses. Key features include:

  • Parsing dump files, including built-in support for processing pages in parallel
  • Wikitext syntax parser that converts the whole page into a parse tree
  • Extracting template definitions and Scribunto Lua module definitions from dump files
  • Expanding selected templates or all templates, and heuristically identifying templates that need to be expanded before parsing is reasonably possible (e.g., templates that emit table start and end tags)
  • Processing and expanding wikitext parser functions
  • Processing, executing, and expanding Scribunto Lua modules (they are very widely used in, e.g., Wiktionary, for example for generating IPA strings for many languages)
  • Controlled expansion of parts of pages for applications that parse overall page structure before parsing but then expand templates on certain sections of the page
  • Capturing information from template arguments while expanding them, as template arguments often contain useful information not available in the expanded content.