I'm trying to figure out how to binary search in suffix array for one occurence of pattern.
Let's have a text: petertomasjohnerrnoerror
.
I try to find er
.
SA
is a suffix array of this text: 8,14,19,3,1,12,10,7,13,17,18,11,6,22,0,23,16,21,15,20,4,9,2,5
Now, I want to find any index of suffix array, which value pointing at one 'er'
. So the output would be index in SA pointing at 3,14 or 19
so it would return 1,2 or 3
I'm trying to use a binary search but I can't figure out how.
def findOneOccurence(text,SA,p):
high = len(text)-1 # The last index
low = 0 # the lowest index
while True:
check = (high-low)/2 # find a middle
if p in text[SA[check]:SA[check]+len(p)]:
return check
else:
if text[SA[check]:SA[check]+len(p)]<p:
low = check
else:
high = check
if high<=low:
return None
This returns 11
. But text[SA[11]:SA[11]+2]
is 'oh'
instad of 'er'
.
Where could be the problem?
This function would work on huge texts about millions of chars.
EDIT: I've found a mistake. Instead of if text[SA[check]:SA[check+len(p)]]<p:
should be text[SA[check]:SA[check]+len(p)]<p:
but it's still wrong. It returns None instead of 'er'
EDIT II: Another mistake: if high>=low changed to high<=low, now, it returns 2 which is good.
EDIT III: Now it works, but on some inputs it gets to the loop and never ends.
Borrowing and editing https://hg.python.org/cpython/file/2.7/Lib/bisect.py