Regex capture groups (lookbehind?)

123 Views Asked by At

I have a string, which can contain 10 or more characters ([0-9a-zA-Z]), e.g.: abcdefghij12345

I want to catch following characters in groups:

  • Group 1: Character position "1 and 2": ab
  • Group 2: Character position "3 and 4": cd
  • Group 3: Character position "5 - 10": efghij
  • Group 4: Character position "6 - Last position of string": fghij12345

Group 1-3 works, but how can a get postion "6 - last postion of string" in Group 4?

What I already have?

r'^([0-9a-zA-Z]{2})([0-9a-zA-Z]{2})([0-9a-zA-Z]{6})'

I expect to get all four groups with one Regex expression. How to expand my expression to get additionally group 4?

Edit: Additionally following Regex is needed for a string of 72 and more characters

I want to catch following characters in groups:

  • Group 1: Character position "1 and 2"

  • Group 2: Character position "3 and 4"

  • Group 3: Character position "5 and 6" ...

  • Group 16: Character position "31 and 32"

  • Group 17: Character position "33 - 40"

  • Group 18: Character position "41 and 42"

  • Group 19: Character position "33 - 40"

  • Group 20: Character position "12 - Last position of string"

String (72 char): 294592522929354526532268626626426854242342362676256672666267626726672667

r'^([\da-zA-Z]{2})([\da-zA-Z]{2})([\da-zA-Z]{2})([\da-zA-Z]{2})([\da-zA-Z]{2})([\da-zA-Z]{2})([\da-zA-Z]{2})([\da-zA-Z]{2})([\da-zA-Z]{2})([\da-zA-Z]{2})([\da-zA-Z]{2})([\da-zA-Z]{2})([\da-zA-Z]{2})([\da-zA-Z]{2})([\da-zA-Z]{2})([\da-zA-Z]{2})([\da-zA-Z]{8})([\da-zA-Z]{2})([\da-zA-Z]{8})'
2

There are 2 best solutions below

6
Ted Lyngmo On

You could use a positive lookahead:

^([\da-zA-Z]{2})([\da-zA-Z]{2})(?=([\da-zA-Z]{6})).([\da-zA-Z].*)$
  • ^ - start of line anchor
  • ([\da-zA-Z]{2}) - first capture group, pos 1-2
  • ([\da-zA-Z]{2}) - second capture group, pos 3-4
  • (?=([\da-zA-Z]{6})) - positive lookahead, third capture, pos 5-10
  • .([\da-zA-Z].*) - discard one character and capture the rest as forth capture, pos 6-end
  • $ - end of line anchor

Demo

2
Timeless On

Since it's an index/position issue, why not just using classical slicing with a tuple-comp ?

S = "abcdefghij12345"

g1, g2, g3, g4 = (S[i:j] for i, j in [(0, 2), (2, 4), (4, 10), (5, None)])

Output :

ab          # <- group1 
cd          # <- group2
efghij      # <- group3
fghij12345  # <- group4