Regex capturing the first occurrence of every group in a recurring pattern

2.6k Views Asked by At

Suppose I have the following text:

Name: John Doe\tAddress: Street 123 ABC\tCity: MyCity

I have a regex (a bit more complex, but it boils down to this):

^(?:(?:(?:Name: (.+?))|(?:Address: (.+?))|(?:City: (.+?)))\t*)+$

which has three capturing groups, that can capture the values of Name, Address and City (if they occur in the text). A few more examples are here: https://regex101.com/r/37nemH/6. EDIT The ordering is not fixed beforehand, and it could also happen that the fields are not separated by \t characters.

Now this all works well, the only slight problem I have is when one field occurs twice in the same text, as can be seen in the last example I put on regex101:

Name: John Doe\tAddress: Street 123 ABC\tCity: MyCity\tAddress: Other Address

What I would want is for the second capturing group to match the first address, i.e. Street 123 ABC, and preferably to let the second occurrence be matched within the "City" group, i.e.

1: John Doe
2: Street 123 ABC
3: MyCity\tAddress: Other Address

Conceptually, I tried doing this with a negative lookbehind, e.g. replacing (?:Address: (.+?)) with (?:(?<!.*Address: )Address: (.+?)), i.e. assuring that an Address: match was not proceded somewhere in the text by another Address: tag. But, negative lookbehind does not allow for arbitrary length, so this obviously would not work.

Can this be achieved using regex, and how?

2

There are 2 best solutions below

0
On BEST ANSWER

If the word order can be any and some or all the items can be missing, it is much easier to use 3 separate patterns to extract the bits you need.

Name (demo):

^.*?Name:\s*(.*?)(?=\s*(?:Name:|Address:|City:|$))

City (demo):

^.*?City:\s*(.*?)(?=\s*(?:Name:|Address:|City:|$))

Address (demo):

^.*?Address:\s*(.*?)(?=\s*(?:Name:|Address:|City:|$))

Details

  • ^ - start of string
  • .*? - any 0+ chars other than line break chars, as few as possible
  • Address: - a keyword to stop at and look for the expected match
  • \s* - 0+ whitespaces
  • (.*?) - Group 1: any 0+ chars other than line break chars, as few as possible...
  • (?=\s*(?:Name:|Address:|City:|$)) - up to but excluding 0 or more whitespaces followed with Name:, Address:, City: or end of string.
2
On

For your stated problem, you may use this regex with a conditional construct:

^.*?(?:(?:Name: (.+?)|(Address: )(.+?)|City: ((?(2).*?Address: )*.+?))\t*)+$

RegEx Demo

Your values are available in captured groups 1, 3, 4.

Capture group 2 is for literal label "Address: ".

Here, (?(2).*?Address: )* is a conditional construct that means if captured group 2 is present then in group 4 match text till next Address: is found (0 or more matches of this).

For the text Name: John Doe Address: Street 123 ABC City: MyCity Address: Second address, it will have following matches:

Group 1.    169-177 `John Doe`
Group 2.    178-187 `Address: `
Group 3.    187-201 `Street 123 ABC`
Group 4.    210-240 `MyCity Address: Second address`