So basically I am trying to search a cell line vector data base that has entries that look like this using langchain:
ID: 253F1
AC: CVCL_B513
SY: NA
OX: NCBI_TaxID=9606; ! Homo sapiens (Human)
CA: Induced pluripotent stem cell
There are easily tens of thousands of these entries in a text file that I store as a vector DB.
I find that if I do a similarity search on say the "Induced pluripotent stem cell", the similarity search always returns relevant documents. However, If i search 253F1 or CVCL_B513 its about a coin flip on whether the similarity search will return relevant documents.
The reason I need to do this form of search as opposed to a direct word match is because sometimes the input query will have varying forms of syntax eg: 253-F1 or 253.F1 or 253f1 scaled over thousands of entries.
Is there an alternative to approaching these short queries? Something that I might find getting better results? I have tried using FAISS to create a vector DB and similarity search on it, but I fear that due to the nature of data too many elements appear similar.