I'm writing a Python script to parse Wikipedia articles, and part of that process is parsing links. I'm trying to write a regular expression that matches in this way:
[[:Category:Anarchism by country|Anarchism by country]]->:Category:Anarchism by country[[Squatting|squat]]->Squatting[[File:Jarach and Zerzan.JPG|thumb|Lawrence Jarach (left) and [[John Zerzan]] (right)->John Zerzan* {{cite book |last=Avrich |first=Paul |author-link=Paul Avrich |title=[[Anarchist Voices: An Oral History of Anarchism in America]] |year=1996 |publisher=[[Princeton University Press]] |isbn=978-0-691-04494-1-> Unmatched, begins with* {{(citation)
I've reached \[\[([^|\]]+)(?:\|[^|\]]+)?\]\] which works in 3 of the above examples, but in the citation it matches the title and the publisher. I know (I think) I need a negative lookahead to prevent any matches in the last example. I'm very bad with regex however, so any suggestions would be greatly appreciated.
Wikitext is quite complicated and should not be parsed with regexes alone. Instead, use a full-fledged parser, such as
mwparserfromhell:For the following wikitext (partly generated by ChatGPT):
...it outputs:
Unfortunately,
mwparserfromhelldoesn't recognize namespaces, so you will have to check forFilelinks on your own if you were to use it. I use a crude.startswith('File')in the function above, but you might want to make a better check, since namespace names are case-insensitive:fileandfIlEare both valid and means the same asFile.