After reading polygenelubricants's series of articles on advanced regular expressions techniques (particularly How does this Java regex detect palindromes?), I decided to attempt to create my own PCRE regex to parse a palindrome, using recursion (in PHP).
What I came up with was:
^(([a-z])(?1)\2|[a-z]?)$
My understanding of this expression is that it should either match zero or one characters (every string of less than 2 characters is implicitly a palindrome, as well as to account for palindromes with odd lengths in the recursion), or two of the same character separated by a recursion of the pattern.
Unfortunately, it does not seem to be working that way, as you can see at www.ideone.com/a9T3F. Instead, only the strings of 2n - 1 (ie. empty string, a
, aaa
, aaaaaaa
, a15) repeated characters match the regular expression.
Oddly, if I modify my pattern so that the recursion is optional (ie. ^(([a-z])(?1)?\2|[a-z]?)$
, see www.ideone.com/D6lJR, it only matches strings with a character repeated 2n times (ie. empty string, a
, aa
, aaaa
, aaaaaaaa
, a16).
Why is my regex not working the way I expect it to?
Note for the people who are itching to suggest not to use regex:
The point of this question is to learn how to use recursive regular expressions properly. I know that this is not an effective way to determine if a string is a palindrome, and I wouldn't use a recursive regex if I for some reason had to determine palindromes in production code; I am just interested in learning more about the advanced aspects of regex.
The phenomenon you're observing is due to the fact that PCRE subpattern recursion is atomic, unlike Perl. The man page actually covers this problem in great detail:
Additional references
(?>…)
in some flavor is atomic grouping syntax(?=…)
,(?!…)
,(?<=…)
,(?<!…)
, are all atomica*+
) is also atomicA closer look at the pattern
The atomicity argument is correct, but perhaps it's not obvious how it explains why the pattern behaves as observed. Let's take a closer look and see how this all fits:
We will use the first pattern:
I will use the following notation to denote the recursion:
1
means the character was captured into group 2 in the first alternate2
means the character was matched by the second alternate2
is not above a character, the zero repetition option of?
is exercised\
means the character was matched by the backreference to group 2 in first alternate_
denotes the bottom of a recursive branchNow let's consider
"aaa"
as input:Now consider
"aaaaa"
:Note that once a recursion level matches on the first alternative, the second alternative will not be attempted in the future (even if doing so may result in a may match), because PCRE subpattern recursion is atomic.
Now consider
"aa"
:Note that once a recursion level matches on the one repetition of the
?
on the second alternative, the zero repetition option will not be attempted in the future (even if doing so may result in a may match), because PCRE subpattern recursion is atomic.Now let's consider
aaaaaaa
Note that even though PCRE subpattern recursion is atomic, it can still successfully match a palindrome consisting of a character repeating 2n-1 times.
Now, just for fun, let's try
"abcba"
:That is, the pattern doesn't just match "only when a character repeats 2n-1 times". It can indeed match
"abcba"
(as seen on ideone.com). It can NOT, however, match"ababa"
, nor can it match"aaaaa"
(see the WARNING on the man page!), because subpattern recursion in PCRE is atomic.You can apply this same tracing process to explain the behavior of the pattern on any input.