I'm trying to figure out how to binary search in suffix array for one occurence of pattern.
Let's have a text: petertomasjohnerrnoerror.
I try to find er.
SA is a suffix array of this text: 8,14,19,3,1,12,10,7,13,17,18,11,6,22,0,23,16,21,15,20,4,9,2,5
Now, I want to find any index of suffix array, which value pointing at one 'er'. So the output would be index in SA pointing at 3,14 or 19 so it would return 1,2 or 3
I'm trying to use a binary search but I can't figure out how.
def findOneOccurence(text,SA,p):
high = len(text)-1 # The last index
low = 0 # the lowest index
while True:
check = (high-low)/2 # find a middle
if p in text[SA[check]:SA[check]+len(p)]:
return check
else:
if text[SA[check]:SA[check]+len(p)]<p:
low = check
else:
high = check
if high<=low:
return None
This returns 11. But text[SA[11]:SA[11]+2] is 'oh' instad of 'er'.
Where could be the problem?
This function would work on huge texts about millions of chars.
EDIT: I've found a mistake. Instead of if text[SA[check]:SA[check+len(p)]]<p: should be text[SA[check]:SA[check]+len(p)]<p: but it's still wrong. It returns None instead of 'er'
EDIT II: Another mistake: if high>=low changed to high<=low, now, it returns 2 which is good.
EDIT III: Now it works, but on some inputs it gets to the loop and never ends.
Borrowing and editing https://hg.python.org/cpython/file/2.7/Lib/bisect.py