Find one occurence of substring using suffix array

789 Views Asked by At

I'm trying to figure out how to binary search in suffix array for one occurence of pattern. Let's have a text: petertomasjohnerrnoerror. I try to find er.

SA is a suffix array of this text: 8,14,19,3,1,12,10,7,13,17,18,11,6,22,0,23,16,21,15,20,4,9,2,5

Now, I want to find any index of suffix array, which value pointing at one 'er'. So the output would be index in SA pointing at 3,14 or 19 so it would return 1,2 or 3

I'm trying to use a binary search but I can't figure out how.

def findOneOccurence(text,SA,p):
    high = len(text)-1           # The last index
    low = 0                      # the lowest index
    while True:
        check = (high-low)/2     # find a middle

        if p in text[SA[check]:SA[check]+len(p)]:
            return check
        else:
            if text[SA[check]:SA[check]+len(p)]<p:
                low = check
            else:
                high = check
        if high<=low:
            return None

This returns 11. But text[SA[11]:SA[11]+2] is 'oh' instad of 'er'. Where could be the problem?

This function would work on huge texts about millions of chars.

EDIT: I've found a mistake. Instead of if text[SA[check]:SA[check+len(p)]]<p: should be text[SA[check]:SA[check]+len(p)]<p: but it's still wrong. It returns None instead of 'er'

EDIT II: Another mistake: if high>=low changed to high<=low, now, it returns 2 which is good.

EDIT III: Now it works, but on some inputs it gets to the loop and never ends.

1

There are 1 best solutions below

0
On

Borrowing and editing https://hg.python.org/cpython/file/2.7/Lib/bisect.py

>>> text= 'petertomasjohnerrnoerror'
>>> SA = 8,14,19,3,1,12,10,7,13,17,18,11,6,22,0,23,16,21,15,20,4,9,2,5
>>> def bisect_left(a, x, text, lo=0, hi=None):
    if lo < 0:
        raise ValueError('lo must be non-negative')
    if hi is None:
        hi = len(a)
    while lo < hi:
        mid = (lo+hi)//2
        if text[a[mid]:] < x: lo = mid+1
        else: hi = mid
    if not text[a[lo]:].startswith(x): 
        # i suppose text[a[lo]:a[lo]+len(x)] == x could be a faster check
        raise IndexError('not found')
    return a[lo]

>>> bisect_left(SA, 'er', text)
14