Issue with memory when using spacy_universal_sentence_encoder for similarity detection

19 Views Asked by aaronted At 28 March 2024 at 17:48

I'm using spacy_universal_sentence_encoder (https://spacy.io/universe/project/spacy-universal-sentence-encoder) for a plagiarism detection app.

I found the model to be practical and more accurate (see here, here) for plagiarism detection after having tested out some other libraries.

When using to test with 'simple' sentences as examples as following :

import spacy_universal_sentence_encoder

# Load one of the models: ['en_use_md', 'en_use_lg', 'xx_use_md', 'xx_use_lg']
nlp = spacy_universal_sentence_encoder.load_model('xx_use_lg')
doc1 = nlp("Toto va à l'école avec son nouveau sac.")
doc2 = nlp("Toto came to school today with a new bag.")

# Similarity of two documents
print("Similarity of two texts : ", doc1, "<->", doc2, doc1.similarity(doc2))

I get the following output

2024-03-28 18:28:21.505655: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-28 18:28:38.693139: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Similarity of two texts :  Toto va à l'école avec son nouveau sac. <-> Toto came to school today with a new bag. 0.8622567857770158

The thing is in my plagiarism detection app, I'm checking a submission (which can be a whole document) against a group of documents as well. Here then, it'll be many more than just a sentence or two being checked.

def check_similarity(document_to_check_against: str, document_to_check: str) -> float:
    """Check similarity between two given texts using spacy."""
    nlp = spacy_universal_sentence_encoder.load_model('xx_use_lg')
    doc1 = nlp(document_to_check_against)
    doc2 = nlp(document_to_check)
    similarity = doc1.similarity(doc2)
    return similarity

That gives me a warning : W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 9059656832 exceeds 10% of free system memory that makes the server crash. I looked up and it relates to some batch size I can reduce. Where could I reduce batch size here ? I mean, is the batch size here the amount of text of documents being processed ? How could I fix this ?

Original Q&A

Issue with memory when using spacy_universal_sentence_encoder for similarity detection

There are 0 best solutions below

Related Questions in PYTHON

Related Questions in TENSORFLOW

Related Questions in NLP

Related Questions in SPACY

Trending Questions

Popular # Hahtags

Popular Questions