I am using the bisect module to search for and insert sha256 hashes into a list.
I have about 8,000,000 items to search and add to, and they are stored in an sqlite database, I want to read them into the list so I can search them quicker.
The issue I have is inserting items into the list using bisect to find the right insertion point is quite slow. It takes about 700 seconds to complete all 8,000,000 items.
It only takes about 90 seconds to create an index in the sqlite database in ascending order, and then about 60 seconds to insert these into the list sequentially.
The trouble is when I do this the bisect search for some items fail, but if I sequentially search the item for the hash it is actually there.
So it appears as though the order provided by the database isnt quite the same as the order provided when using bisect to get the index position.
Any ideas why this would be? it would be really useful to be able to pre-sort the list before relying on bisect.
UPDATE.... Based on a comment, I should explain that I have a custom class that behaves like a list, that packs the hashes in a bytearray to save memory. Here is my Class
class Hashlist():
def __init__(self, hashLen):
self.__hashLen = hashLen
self.__hashlist = bytearray()
self.__num_items = 0
def __getitem__(self, index):
if index >= len(self) or index < 0:
print index
raise IndexError("hash index out of range")
return
return str(self.__hashlist[index*self.__hashLen:(index+1)*self.__hashLen])
def __setitem__(self, index, data):
if index > len(self) or index < 0:
raise IndexError("hash index out of range")
return
if index == len(self):
self.__hashlist.extend(data)
else:
self.__hashlist[index*self.__hashLen:(index+1)*self.__hashLen] = data
def insert(self, index, data):
oldlen = len(self.__hashlist)/self.__hashLen
if index > oldlen or index < 0:
raise IndexError("trying to insert past next element")
return
if index == oldlen:
self.__hashlist.extend(data)
else:
# move the data
if self.__hashLen == 1:
self.__hashlist.append(chr(0))
orig_data = str(self.__hashlist[(index):(len(self.__hashlist)-1)])
self.__hashlist[(index + 1)*self.__hashLen:(len(self.__hashlist))*self.__hashLen] = orig_data
#replace existing data
self.__hashlist[index*self.__hashLen:(index+1)*self.__hashLen] = data
else:
orig_data = str(self.__hashlist[(index*self.__hashLen):(len(self.__hashlist) -1)*self.__hashLen])
self.__hashlist[(index + 1)*self.__hashLen:(len(self.__hashlist))*self.__hashLen] = orig_data
#replace existing data
self.__hashlist[index*self.__hashLen:(index+1)*self.__hashLen] = data
Thanks
Dean
If they're stored in a SQL database, an index doesn't guarantee that results are returned in "sorted" order - you have to be explicit by using "order by".
Also, if you're doing that many inserts then I wouldn't use bisect, instead sort/merge.