In Perl, is there a limit to the number of capture groups in a Regular Expression?

318 Views Asked by At

Is there a limit to the number of capture groups in a regular expression? I used to think it was 9 ($1 ... $9), but haven't found anything in the perlre docs to confirm this. And in fact, the following code shows that there are at least 26.

#!/usr/local/bin/perl

use strict;
use warnings;

my $line = " a b c d e f g h i j k l m n o p q r s t u v w x y z ";

my $lp = "(\\w) ";
my $pat = "";
for (my $i=0; $i<26; $i++)
{
   $pat = $pat . $lp;
}

$line =~ /$pat/;
print "$1 $2 $3 $24 $25 $26\n";

Note that this question: How many captured groups are supported by pcre2 substitute function only refers to the PCRE2 C library. I'm asking about Perl.

3

There are 3 best solutions below

5
ysth On BEST ANSWER

https://perldoc.perl.org/perlre says:

There is no limit to the number of captured substrings that you may use.

2
Kjetil S. On

Why not just test it. Regexp with 20 million captures which ought to be enough for anybody. Makes me think memory is the limit here. This took 25 seconds on my old laptop with perl v5.30:

my $n = 20_000_000;                 # 20 million
my $re = join"", map "(.)", 1..$n;  # create regexp with 20 million captures
my $str = "ABC" x $n;               # create a more than long enough string
$str =~ /$re/;                      # match & capture
print $19999987, "\n";              # print the "A" in capture var number 19999987
print ${^CAPTURE}[19999987-1],"\n"; # same
print "Length: ".@{^CAPTURE}."\n";  # prints 20000000, length of array
0
brian d foy On

You can just try it! Even if there is no built-in limit, there's probably a practical one.

Let's try in on my M1 Mac Mini with Perl v5.36.

Here's a little program to take a number of captures I want, then builds a string long enough to match that and a pattern with that number of captures (check out that use of the v5.36 builtin::ceil):

#!perl

use v5.36;
use experimental qw(builtin);
use builtin qw(ceil);

my $n = shift;
say "N is $n";

my $alpha = join '', 'a' .. 'z';
my $multiple = ceil($n / 26);
my $text = $alpha x ($multiple + 1);

my $n_mod_26 = $n % 26;
my $expected_letter = substr $alpha, $n_mod_26 - 1, 1;

my $pattern_text = '(.)' x $n;
my $pattern = qr/$pattern_text/;

my $result = $text =~ $pattern;
say $result ? "Matched" : 'Did not match';

no strict 'refs';
my $matched = do { no strict 'refs'; ${"$n"} };
print "Matched <$matched>; expected <$expected_letter>\n";

When I run this for varying lengths, I eventually get the shell to give up:

brian@M1-Mini Desktop % for i in 1 3 7 50 500 5000 70000 900000 3000000 40000000 1234567890; do echo '----' && time perl test.pl $i; done
----
N is 1
Matched
Matched <a>; expected <a>
perl test.pl $i  0.02s user 0.01s system 67% cpu 0.047 total
----
N is 3
Matched
Matched <c>; expected <c>
perl test.pl $i  0.01s user 0.00s system 91% cpu 0.014 total
----
N is 7
Matched
Matched <g>; expected <g>
perl test.pl $i  0.01s user 0.00s system 92% cpu 0.011 total
----
N is 50
Matched
Matched <x>; expected <x>
perl test.pl $i  0.01s user 0.00s system 92% cpu 0.010 total
----
N is 500
Matched
Matched <f>; expected <f>
perl test.pl $i  0.01s user 0.00s system 92% cpu 0.008 total
----
N is 5000
Matched
Matched <h>; expected <h>
perl test.pl $i  0.01s user 0.00s system 93% cpu 0.008 total
----
N is 70000
Matched
Matched <h>; expected <h>
perl test.pl $i  0.02s user 0.00s system 97% cpu 0.022 total
----
N is 900000
Matched
Matched <j>; expected <j>
perl test.pl $i  0.20s user 0.02s system 97% cpu 0.229 total
----
N is 3000000
Matched
Matched <p>; expected <p>
perl test.pl $i  0.69s user 0.06s system 95% cpu 0.786 total
----
N is 40000000
Matched
Matched <n>; expected <n>
perl test.pl $i  9.32s user 1.08s system 91% cpu 11.402 total
----
N is 1234567890
zsh: killed     perl test.pl $i
perl test.pl $i  127.80s user 6.17s system 83% cpu 2:39.69 total

My machine gives up with 1,234,567,890 groups. That might have nothing to do with the number of groups; maybe something else in perl decided it was unhappy, or maybe the program went past some process resource limit. Your own machine may give up at a different point (or not give up at all). I have no idea what killed it, and I don't really care because even if I knew, I'm not going to do anything to fix that.

But, can I find the maximum number? It's somewhere around 389,000,000 captures. It's not a set number that I can consistently predict and probably depends on other, unrelated things going on at the same time.