I am going to share an example below and I am interested if anyone has any insight into best practices when creating embeddings. I use OpenAI "text-embedding-ada-002" model.
So I created embeddings for the following inputs:
"Dog"
"Cat"
"Monkey"
"Peanut butter"
Now I would think that the following would be bucketed close together as they are animals:
"Dog"
"Cat"
"Monkey"
and if I created an embedding an embedding for another animal and ran a similarity search against my vector db, in most cases I would find that if I creating an embedding for an animal, then the top results returned would be an animal; and if I created an embedding for a food, then the top result would be "peanut butter."
However, I found that in some cases I would not get what I expect. For example, I created an embedding for the input "Tomato" and ran a similarity search. While I would have expected the top result to be another food like "Peanut butter", the top result was "Cat," and I am not sure why. Can someone help explain or advise what best practices to follow when creating embeddings?