I've been looking at the wikimedia abstracts dump file (enwiki-latest-abstract.xml.gz) for the last week and noticed that the abstracts for many items appear to be corrupted.
For example, the wikipedia page for Alabama contains the following dumped abstract:
<title>Wikipedia: Alabama</title>
<url>https://en.wikipedia.org/wiki/Alabama</url>
<abstract>(We dare defend our rights)</abstract>
Similarly, the abstract for the Abraham Lincoln item is:
<title>Wikipedia: Abraham Lincoln</title>
<url>https://en.wikipedia.org/wiki/Abraham_Lincoln</url>
<abstract>| term_start1 = March 4, 1847</abstract>
Which appears to be a partial snippet from the infobox.
This kind of corruption seems to be present for a majority of items in the the enwiki-latest-abstract.xml.gz.
I'd appreciate any advice anyone has on whether this is a bug or whether I have a misunderstanding about this dump file.
Thanks!
This is probably just the extraction code behaving badly; it's not very sophisticated.
FWIW Wikipedia has two different extract/summary APIs, which both seem behave reasonably here (the older, api.php-based one is a bit broken but not completely broken):
Neither of those have dumps though.