I am having a perl script that has 2 arrays, 1 with keys and 1 with substring. I need to check if substring of 1 array have matches in the keys array. The amount of records is huge, something that can be counted in millions so I use Inline:C to speed up the search, however it is still taking hours to treat the records.
--Perl part
//%h contains {"AAAAA1" => 1, "BBBBBB" => 1, "BB1234" =>1, "C12345" => 1.... }
my @k=sort keys %h;
//@k contains ["AAAAA1", "BBBBBB", "BB1234", "C12345".... ]
my @nn;
//@n contains [ "AAAAA1999", "AAAAABBB134", "D123edae", "C12345CCSAER"]
// "AAAAA1" (from @k) can be found in "AAAAA1999" (in @n) = OK
foreach(@n) {
my $res=array_search(\@k,$_);
if($res) {
$y++;
} else {
$z++;
push @nn,$_;
}
}
--C part
int fastcmp ( char *p1, char *p2 ) {
while( *p1 ){
char *a = p1, *b = p2;
if (*b != *a) return 0;
++p1; ++b;
}
return 1;
}
int array_search(AV *a1, SV *s1){
STRLEN bytes1;
char *p1,*p2,*n;
long a1_size,i,c;
a1_size = av_len(a1);
p1 = SvPV(s1,bytes1);
for(i=start;i<=a1_size;++i){
SV** elem = av_fetch(a1, i, 0);
SV** elem_next = (i<a1_size-1)?av_fetch(a1, i+1, 0):elem;
p2 = SvPV_nolen (*elem);
n = SvPV_nolen (*elem_next);
if (p1[0] == p2[0]) {
if (fastcmp(p1,p2)>0) {
return i;
}
}
if ((p1[0] == p2[0]) && (p2[0] != n[0])) { return -1; }
}
return -1;
}
If somebody could help to optimize the search, that could be nice. Thanks.
Note: added comments to help what is inside each variables.
The implementation you have fails in many ways:
@a=chr(0xE9); utf8::upgrade($x=$a[0]); array_search(\@a, $x);
"abc"=~/(.*)/; array_search(["abc"], $1);
array_search(["a\0b"], "a\0c");
It also incorrectly assumes the strings are null-ternminated, which can lead to a SEGFAULT when they aren't.
Your approach scans
@k
for each element of@n
, but if you build a trie (as the following code does), it can be scanned once.For example, if there are 1,000 Ns and 1,000 Hs, your solution does up to 1,000,000 comparisons and mine does 1,000.
Note that 5.10+ is needed for the regex optimisation of alternations into a trie. Regexp::List can be used on older versions.
A proper C implementation will be a little faster because you can do a trie search using a function that does just that rather than using the regex engine.