I have an array of strings of about 100,000 elements. I need to iterate through each element and replace some words with other words. This takes a few seconds in pure perl. I need to speed this up as much as I can. I'm testing using the following snippet:
use strict;
my $string = "This is some string. Its only purpose is for testing.";
for( my $i = 1; $i < 100000; $i++ ) {
$string =~ s/old1/new1/ig;
$string =~ s/old2/new2/ig;
$string =~ s/old3/new3/ig;
$string =~ s/old4/new4/ig;
$string =~ s/old5/new5/ig;
}
I know this doesn't actually replace anything in the test string, but it's for speed testing only.
I had my hopes set on Inline::C. I've never worked with Inline::C
before but after reading up on it a bit, I thought it was fairly simple to implement. But apparently, even calling a stub function that does nothing is a lot slower. Here's the snippet I tested with:
use strict;
use Benchmark qw ( timethese );
use Inline 'C';
timethese(
5,
{
"Pure Perl" => \&pure_perl,
"Inline C" => \&inline_c
}
);
sub pure_perl {
my $string = "This is some string. Its only purpose is for testing.";
for( my $i = 1; $i < 1000000; $i++ ) {
$string =~ s/old1/new1/ig;
$string =~ s/old2/new2/ig;
$string =~ s/old3/new3/ig;
$string =~ s/old4/new4/ig;
$string =~ s/old5/new5/ig;
}
}
sub inline_c {
my $string = "This is some string. Its only purpose is for testing.";
for( my $i = 1; $i < 1000000; $i++ ) {
$string = findreplace( $string, "old1", "new1" );
$string = findreplace( $string, "old2", "new2" );
$string = findreplace( $string, "old3", "new3" );
$string = findreplace( $string, "old4", "new4" );
$string = findreplace( $string, "old5", "new5" );
}
}
__DATA__
__C__
char *
findreplace( char *text, char *what, char *with ) {
return text;
}
on my Linux box, the result is:
Benchmark: timing 5 iterations of Inline C, Pure Perl...
Inline C: 6 wallclock secs ( 5.51 usr + 0.02 sys = 5.53 CPU) @ 0.90/s (n=5)
Pure Perl: 2 wallclock secs ( 2.51 usr + 0.00 sys = 2.51 CPU) @ 1.99/s (n=5)
Pure Perl is twice as fast as calling an empty C function. Not at all what I expected! Again, I've never worked with Inline::C before so maybe I am missing something here?
In the version using
Inline::C
, you kept everything that was in the original pure Perl script, and changed just one thing: Additionally, you've replaced Perl's highly optimizeds///
with a worse implementation. Invoking your dummy function actually involves work whereas none of thes///
invocations do much in this case. It is a priori impossible for theInline::C
version to run faster.On the C side, the function
is not a "do nothing" function. Calling it involves unpacking arguments. The string pointed to by
text
has to be copied to the return value. There is some overhead which you are paying for each invocation.Given that
s///
does no replacements, there is no copying involved in that. In addition, Perl'ss///
is highly optimized. Are you sure you can write a better find & replace that is faster to make up for the overhead of calling an external function?If you use the following implementation, you should get comparable speeds:
The one possibility of gaining speed is to exploit any special structure involved in the search pattern and replacements and write something to implement that.
On the Perl side, you should at least pre-compile the patterns.
Also, since your problem is embarrassingly parallel, you are better off looking into chopping up the work into as many chunks as you have cores to work with.
For example, take a look at the Perl entries in the regex-redux task in the Benchmarks Game:
Perl #4 (fork only): 14.13 seconds
and
Perl #3 (fork & threads): 14.47 seconds
versus
Perl #1: 34.01 seconds
That is, some primitive exploitation of parallelization possibilities results in a 60% speedup. That problem is not exactly comparable because the substitutions must be done sequentially, but still gives you an idea.
If you have eight cores, dole out the work to eight cores.
Also, consider the following script:
In this case, each invocation of
find_and_replace
takes about 2.3 seconds my laptop. The five replications run in about 30 seconds. The overhead is the combined cost of generating the 1,000,000 sentence data set and copying it four times.