Output function for Aho-Corasick algorithm

353 Views Asked by At

I have a problem with implementation of output function for Aho-Corasick algorithm. In general I don't quite understand how output function works. According to this paper in goto function I put appropriate pattern index to output, like output[currentState] = patternIndex; in fail function I merge existed output for state S with output for failed state S, like output[s] += output[fail[s]]; in search function I use such condition: if (output[state]) != 0) then we find our pattern. But such approach doesn't work and I've got nonsenses result. Maybe, pseudocode for output function means something else in that paper, I don't know.

Output function with bitwise mapping which I've got here also doesn't work correctly in most cases. Wrong patterns match with this condition if ((output[currentState] & (1 << j)) != 0) Besides, I don't quite understand why bitwise operations are using for output calculation.

I will be grateful for any help in clarification of implementation of the output function.

EDIT1: seems, that problem was in overflowing after bit shift operation: output[currentState] |= (1 << i);and out[currentState] & (1 << j) So far, tried to use BigInteger, but seems, that it causes a problem with performance.

EDIT2: tried BitSet, but performance still very poor. Maybe, there is a problem in implementation of algorithm, i don't know.

1

There are 1 best solutions below

0
Jason On

The 32 bits bitwise operation just like this below

pattern: he, she, his, hers

input: shethehishetheehershehehishe

The total state number is 9, and 2, 5, 7, 9 state is final state like below AC automation

Output function represented which a pattern you matched in current state. like :

out[2] = he, out[5] = she/he, out[7] = his, out[9] = hers

The code "out[currentState] |= (1 << i)" just like below 

out[2] = 0000 0000 0000 0000 0000 0010 <-- means number 1(he) pattern matched 

out[5] = 0000 0000 0000 0000 0000 0110 <-- means number 1(he) and number 2(she) matched

so use output[currentState] & (1 << j) to calculate which pattern is matched.

This operation just like one-hot encoding. And the operation problem is type of output function. If the output function type is 32 bits, and we can just search at most 31 patterns. Because over the 31 patterns, it will overflow by shifting.