I am a beginner data scientist, trying to write fast-duplicate search using LSH implementation from datasketch. When I run my program with input text with big size(number of docs > 250000), step 1 is fine, but then program hangs on step 2. When I run program with small input, everything works fine. Is there any decision how to fix this problem?
def LSH(data, num_perm = 128, threshold = 0.5, check_const = 0.9):
vec_unig = CountVectorizer(min_df=50, analyzer = 'word', stop_words = ['_dot_', '_comma_''_voskl_'], ngram_range=(1,2))
X = vec_unig.fit_transform([" ".join(i) for i in data])
length = X.shape[0]
array1 = []
print("Collection:" ,length)
print("Step 1:")
print("Form Minhash")
start = datetime.now()
for i in range(len(data)):
print(i)
m = MinHash(num_perm = num_perm)
for d in data[i]:
m.update(d.encode('utf8'))
array1.append(m)
print(datetime.now()- start)
print("Step 2")
print("Form potential clusters")
start = datetime.now()
lsh = MinHashLSH(threshold = threshold, num_perm = num_perm)
for i in range(len(array1)):
if ((i % 100) == 0):
print(i)
lsh.insert(i, array1[i])
print(datetime.now()- start)