It is easy to download dumps of Wikipedia in XML format. However, the content of the articles are written in wikitext, which has a template system. To extract clean full texts from these dumps, it is necessary to expand these templates. Wikipedia provides an API to do so but it is not suitable for expanding an entire dump. Several scripts can be found to deal with wikitext, such as this one written in python, but they all seems outdated or simply don't deal with templates. Another way of tackling this problem would be to run Wikimedia on a computer and use the API:Expandtemplates but it seems to be a quite cumbersome solution. Finally, HTML dumps also exist, but I prefer to work with expanded wikitexts since it makes it easier to deal with wikilinks, tables, sections etc.
My goal here is to extract clean texts while keeping the wikilinks and discarding complicated templates such as info-boxes. Do you have any idea how to tackle this template expansion problem ?
I made a solution that uses Kiwix to get clean texts from Wikipedia. The HTML produced by Kiwix seems easy to parse for my purpose. I don't make the code available anymore (didn't have time to make something shareable). But it turned out to be effective and fast to implement.