Regexpr which excludes groups if they are precedeeded by curly brackets and only matches text within the first section of the bracket

Question

Regexpr which excludes groups if they are precedeeded by curly brackets and only matches text within the first section of the bracket

47 Views Asked by jemhop At 03 September 2023 at 05:29

I'm writing a Python script to parse Wikipedia articles, and part of that process is parsing links. I'm trying to write a regular expression that matches in this way:

[[:Category:Anarchism by country|Anarchism by country]] -> :Category:Anarchism by country
[[Squatting|squat]] -> Squatting
[[File:Jarach and Zerzan.JPG|thumb|Lawrence Jarach (left) and [[John Zerzan]] (right) -> John Zerzan
* {{cite book |last=Avrich |first=Paul |author-link=Paul Avrich |title=[[Anarchist Voices: An Oral History of Anarchism in America]] |year=1996 |publisher=[[Princeton University Press]] |isbn=978-0-691-04494-1 -> Unmatched, begins with * {{ (citation)

I've reached \[\[([^|\]]+)(?:\|[^|\]]+)?\]\] which works in 3 of the above examples, but in the citation it matches the title and the publisher. I know (I think) I need a negative lookahead to prevent any matches in the last example. I'm very bad with regex however, so any suggestions would be greatly appreciated.

Original Q&A

There are 1 best solutions below

**InSync** · Answer 1 · 2023-09-03T11:37:44.327000

Wikitext is quite complicated and should not be parsed with regexes alone. Instead, use a full-fledged parser, such as mwparserfromhell:

import mwparserfromhell as mph

def get_links_outside_of_templates(text):
  tree = mph.parse(text)
  # Lazily filter out all top-level links
  links = tree.ifilter_wikilinks(recursive = False)
    
  for link in links:
    if link.title.startswith('File'):
      # If this is a File link, recursively parse its "text".
      yield from get_links_outside_of_templates(link.text)
    else:
      yield link.title

print([*get_links_outside_of_templates(text)])

For the following wikitext (partly generated by ChatGPT):

'''Squatting''' may refer to [[Squatting|squat]], the act of occupying an abandoned or unused property without legal permission.

== Foo ==

[[File:Jarach and Zerzan.JPG|thumb|Lawrence Jarach (left) and [[John Zerzan]] (right)]]

Lorem ipsum dolor sit [[amet]], consectetur adipiscing elit. Vestibulum interdum, neque nec aliquet venenatis, tortor erat commodo nulla, id imperdiet mi urna eget nunc.

== References ==
* {{cite book
  |last=Avrich |first=Paul |author-link=Paul Avrich
  |title=[[Anarchist Voices: An Oral History of Anarchism in America]]
  |year=1996 |publisher=[[Princeton University Press]]
  |isbn=978-0-691-04494-1
  }}

[[:Category:Anarchism by country|Anarchism by country]]

...it outputs:

['Squatting', 'John Zerzan', 'amet', ':Category:Anarchism by country']

Unfortunately, mwparserfromhell doesn't recognize namespaces, so you will have to check for File links on your own if you were to use it. I use a crude .startswith('File') in the function above, but you might want to make a better check, since namespace names are case-insensitive: file and fIlE are both valid and means the same as File.

Regexpr which excludes groups if they are precedeeded by curly brackets and only matches text within the first section of the bracket

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in REGEX

Related Questions in WIKITEXT

Trending Questions

Popular # Hahtags

Popular Questions