Imagine I have some code as such. I am using the encode function to create embeddings. Then from these I would look to calculate a cosine similarity score, after all the model I have selected is geared towards cosine similarity (as opposed to dot-product similarity).
My question is do you always embed the entire string as it is or would you/could you produce cleaning on the two strings before you encode them? Stopwords out. Maybe only keep nouns or entities. Is this a thing, or would the discontinuity/non-grammatical possibility in the resulting strings hurt us?
from sentence_transformers import SentenceTransformer, util
model_name = 'sentence-transformers/multi-qa-mpnet-base-cos-v1'
model = SentenceTransformer(model_name)
phrase1 = 'Some arbitrarily long string from a book'
phrase2 = "This is a another arbitrarily long string from the same book'
emb1 = model.encode(phrase1)
emb2 = model.encode(phrase2)
I get a cosine similarity which is not spread out that well. There isn't enough separation between good matches and bad matches.

Since you are using sentence embeddings, encoding the whole sentence makes more sense.
An alternative approach in order to increase separation could be, if you have an idea of the categories where most of texts fall in, then you can use a zero-shot classifier to give scores to each text with every category. You can keep refining the categories like a semi-supervised approach.