Problem with LSH implementation from Datasketch (size of input data > 150000)

252 Views Asked by At

I am a beginner data scientist, trying to write fast-duplicate search using LSH implementation from datasketch. When I run my program with input text with big size(number of docs > 250000), step 1 is fine, but then program hangs on step 2. When I run program with small input, everything works fine. Is there any decision how to fix this problem?

def LSH(data, num_perm = 128, threshold = 0.5, check_const = 0.9):
    vec_unig = CountVectorizer(min_df=50, analyzer = 'word', stop_words = ['_dot_', '_comma_''_voskl_'], ngram_range=(1,2))
    X = vec_unig.fit_transform([" ".join(i) for i in data])
    length = X.shape[0]
    array1 = []
    print("Collection:" ,length)
    print("Step 1:")
    print("Form Minhash")
    start = datetime.now()
    for i in range(len(data)):
        print(i)
        m = MinHash(num_perm = num_perm)
        for d in data[i]:
            m.update(d.encode('utf8'))
        array1.append(m)
    print(datetime.now()- start)
    print("Step 2")
    print("Form potential clusters")
    start = datetime.now()
    lsh = MinHashLSH(threshold = threshold, num_perm = num_perm)
    for i in range(len(array1)):
        if ((i % 100) == 0):
            print(i)
        lsh.insert(i, array1[i])
    print(datetime.now()- start)
0

There are 0 best solutions below