Regex capturing too much

99 Views Asked by At

I have a problem with a .NET regex that I need to create for my AutoWikiBrowser bot on Wikipedia.

The example is rather long, but I need an even longer regex to find up to 14 language indication templates (2-3 letters inside double curly brackets, e.g. {{ab}}) and merge them into a single template (e.g. {{ab}} {{cd}} {{ef}} {{gh}} => {{mul|ab|cd|ef|gh}}

Here is my regex:

Find: \{\{ *(ab|cd|ef|gh) *\}\} *\{\{ *(ab|cd|ef|gh) *\}\} *(\{\{ *(ab|cd|ef|gh) *\}\})* *(\{\{ *(ab|cd|ef|gh) *\}\})* *(\{\{ *(ab|cd|ef|gh) *\}\})* *(\{\{ *(ab|cd|ef|gh) *\}\})*

Replace: {{mul|$1|$2|$4|$6|$8|$10}}

It is actually working as intended, except if templates are not separated by a space, then the last templates aren't captured properly. You can see the problem with the first line of the test string here: https://regex101.com/r/nMUg0J/2

I think I should use a lookaround, but I can't even find where the problem is.

Note that this regex will create templates with useless pipes if there isn't enough templates to marge, but I'll use this other regex after the first one to remove them: https://regex101.com/r/MuIiWS/1

2

There are 2 best solutions below

3
Nick On BEST ANSWER

This is almost certainly more easily achieved by using a replacement function with a single regex, but if you are restricted to regex only, possibly an easier solution is to first replace the }}{{ between templates with a | and then add the mul| at the beginning of any multiple lanugage template. So first, replace:

(?<=ab|cd|ef|gh) *}} *{{ *(?=ab|cd|ef|gh)

with | (demo on regex101), and then replace

(?<={{)(?=(?:ab|cd|ef|gh)\|)

with mul| (demo on regex101)

6
oli_vi_er On

So, thanks to @Nick who solved the problem I raised above, and also the tip in this post comments (I edited the following regexes in accordance).

I'll just talk a bit some issues I encountered that are specific to Wikipedia where I made the edits :

If the preceding template ends or the following one begins with letters that correspond to the set of letters in the regex, their culry brackets are modified altogether with the ones between the language indication templates : https://regex101.com/r/yY5H0C/1

It is then neccessary to encapsulate all the language codes in \b tags : https://regex101.com/r/YcZZVQ/2

The complete process taking in account the surrounding wikicode needs 4 steps:

  1. replacing the curly brackets between the language templates by pipes: https://regex101.com/r/ssQxSO/2
  2. adding the template name in front: https://regex101.com/r/NfohdT/4
  3. removing the potential leftover space at the end of the template, and adding a potential space between the next template/link/text: https://regex101.com/r/tVfxG8/2
  4. adding a potential space between the previous template/link/text: https://regex101.com/r/jw6Lt8/2

The last two operations are "non-cosmetic" and do not affect what the reader of the article will see, but it makes the wikicode cleaner.

Thanks again to @Nick for his idea and his regexes, which have enabled me to deal with over 17,000 pages with problems in Wikipedia in French, and will no doubt facilitate future maintenance operations of this type when the problem occurs again.