I'm using spacy_universal_sentence_encoder (https://spacy.io/universe/project/spacy-universal-sentence-encoder) for a plagiarism detection app.
I found the model to be practical and more accurate (see here, here) for plagiarism detection after having tested out some other libraries.
When using to test with 'simple' sentences as examples as following :
import spacy_universal_sentence_encoder
# Load one of the models: ['en_use_md', 'en_use_lg', 'xx_use_md', 'xx_use_lg']
nlp = spacy_universal_sentence_encoder.load_model('xx_use_lg')
doc1 = nlp("Toto va à l'école avec son nouveau sac.")
doc2 = nlp("Toto came to school today with a new bag.")
# Similarity of two documents
print("Similarity of two texts : ", doc1, "<->", doc2, doc1.similarity(doc2))
I get the following output
2024-03-28 18:28:21.505655: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-28 18:28:38.693139: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Similarity of two texts : Toto va à l'école avec son nouveau sac. <-> Toto came to school today with a new bag. 0.8622567857770158
The thing is in my plagiarism detection app, I'm checking a submission (which can be a whole document) against a group of documents as well. Here then, it'll be many more than just a sentence or two being checked.
def check_similarity(document_to_check_against: str, document_to_check: str) -> float:
"""Check similarity between two given texts using spacy."""
nlp = spacy_universal_sentence_encoder.load_model('xx_use_lg')
doc1 = nlp(document_to_check_against)
doc2 = nlp(document_to_check)
similarity = doc1.similarity(doc2)
return similarity
That gives me a warning : W external/local_tsl/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 9059656832 exceeds 10% of free system memory that makes the server crash.
I looked up and it relates to some batch size I can reduce. Where could I reduce batch size here ? I mean, is the batch size here the amount of text of documents being processed ?
How could I fix this ?