How do you match part of a repeating group to a part from a previous repeat, in Regex?

53 Views Asked by At

Suppose I have a data storage or filing system and I accept a few formats (legacy reasons, not my own design)

So for example, I need to accept

abcd.efgh.1234.4567
abcd-efgh-1234-4567
abcd|efgh|1234|4567

but not

abcd.efgh-1234|4567

Basically I need to be consistent about the delimiters I use. I am trying to construct a regex that can check that but I am finding it really tricky. I have explored regex references and see how they would work for finding repeats like abc-abc-abc, but in my case I need it to allow the abcd part to be different and only ensure I have the same delimeter

Here's what I've got so far (link to Regex101);

(([a-z1-9]){4}([\.:|])){3}(([a-z1-9]){4})

I need to somehow give a backreference to that ([\.:|]) but I can't put it in there since it needs to repeat on itself.

Is there anyway to do this in Regex?

2

There are 2 best solutions below

0
On BEST ANSWER

You can capture the delimiter when it first appears, and then back reference it later:

[a-z1-9]{4}([.:|])(?:[a-z1-9]{4}\1){2}[a-z1-9]{4}  

See regex demo.

  • [a-z1-9]{4} matches a length four word;
  • ([.:|]) matches and captures the delimiter;
  • (?:[a-z1-9]{4}\1){2} captures the second and third patterns, the delimiter is referred to as the delimiter captured above;
  • [a-z1-9]{4} matches the last word;
4
On

Your regex could be \w+([.|-])\w+\1\d+\1\d+ See: example 1

It uses backreferences \1 to the first encountered separator ("|", "." or "-")

Test:

$ cat repeat.txt
abcd.efgh.1234.4567
abcd-efgh-1234-4567
abcd|efgh|1234|4567
abcd.efgh-1234|4567

Result:

$ grep -P '\w+([.|-])\w+\1\d+\1\d+' repeat.txt
abcd.efgh.1234.4567
abcd-efgh-1234-4567
abcd|efgh|1234|4567

Or, more generic:

$ grep -P '\w+(\W)\w+(\1\w+)+' repeat.txt
abcd.efgh.1234.4567
abcd-efgh-1234-4567
abcd|efgh|1234|4567

See: example 2. The problem with the last one, though, can be that a repeating group only captures the last occurrence.