Extracting two strings from between two characters. Why doesn't my regex match and how can I improve it?

173 Views Asked by At

I'm learning about regular expressions and I to want extract a string from a text that has the following characteristic:

  • It always begins with the letter C, in either lowercase or uppercase, which is then followed by a number of hexadecimal characters (meaning it can contain the letters A to F and numbers from 1 to 9, with no zeros included).
  • After those hexadecimal characters comes a letter P, also either in lowercase or uppercase
  • And then some more hexadecimal characters (again, excluding 0).

Meaning I want to capture the strings that come in between the letters C and P as well as the string that comes after the letter P and concatenate them into a single string, while discarding the letters C and P

Examples of valid strings would be:

c45AFP2
CAPF
c56Bp26
CA6C22pAAA

For the above examples what I want would be to extract the following, in the same order:

45AF2     # Original string: c45AFP2
AF        # Original string: CAPF
56B26     # Original string: c56Bp26
A6C22AAA  # Original string: CA6C22pAAA

Examples of invalid strings would be:

BCA6C22pAAA  # It doesn't begin with C
c56Bp  # There aren't any characters after P
c45AF0P2  # Contains a zero

I'm using python and I want a regex to extract the two strings that come both in between the characters C and P as well as after P

So far I've come up with this:

(?<=\A[cC])[a-fA-F1-9]*(?<=[pP])[a-fA-F1-9]*

A breakdown would be:

(?<=\A[cC]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [cC] and that [cC] must be at the beginning of the string

[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times

(?<=[pP]) Positive lookbehind assertion. Asserts that what comes before the regex parser’s current position must match [pP]

[a-fA-F1-9]* Matches a single character in the list between zero and unlimited times

But with the above regex I can't match any of the strings!

When I insert a | in between (?<=[cC])[a-fA-F1-9]* and (?<=[pP])[a-fA-F1-9]* it works.

Meaning the below regex works:

(?<=[cC])[a-fA-F1-9]*|(?<=[pP])[a-fA-F1-9]*

I know that | means that it should match at most one of the specified regex expressions. But it's non greedy and it returns the first match that it finds. The remaining expressions aren’t tested, right?

But using | means the string BCA6C22pAAA is a partial match to AAA since it comes after P, even though the first assertion isn't true, since it doesn't begin with a C.

That shouldn't be the case. I want it to only match if all conditions explained in the beginning are true.

Could someone explain to me why my first attempt doesn't produces the result I want? Also, how can I improve my regex?

I still need it to:

  • Not be a match if the string contains the number 0
  • Only be a match if ALL conditions are met

Thank you

2

There are 2 best solutions below

2
sseLtaH On BEST ANSWER

To match both groups before and after P or p

(?<=^[Cc])[1-9a-fA-F]+(?=[Pp]([1-9a-fA-F]+$))
  • (?<=^[Cc]) - Positive Lookbehind. Must match a case insensitive C or c at the start of the line
  • [1-9a-fA-F]+ - Matches hexadecimal characters one or more times
  • (?=[Pp] - Positive Lookahead for case insensitive p or P
  • ([1-9a-fA-F]+$) - Cature group for one or more hexadecimal characters following the pP View Demo
3
Bohemian On

Your main problem is you're using a look behind (?<=[pP]) for something ahead, which will never work: You need a look ahead (?=...).

Also, the final quantifier should be + not * because you require at least one trailing character after the p.

The final mistake is that you're not capturing anything, you're only matching, so put what you want to capture inside brackets, which also means you can remove all look arounds.

If you use the case insensitive flag, it makes the regex much smaller and easier to read.

A working regex that captures the 2 hex parts in groups 1 and 2 is:

(?i)^c([a-f1-9]*)p([a-f1-9]+)

See live demo.

Unless you need to use \A, prefer ^ (start of input) over \A (start of all input in multi line scenario) because ^ easier to read and \A won't match every line, which is what many situations and tools expect. I've used ^.