Let me describe my problem. There is an input string, and a table containing many thousands of strings. I am looking for best way to search for the most similar* strings to the input string. The search should return a list of ~10 suggested strings, sorted by degree of similarity. Strings also have numerical weights (popularity) associated with them in database, in another column, so the ones with higher weights should have higher chance of appearing in results, if possible.
What is the best library to achieve this? I am looking for something similar to Elasticsearch, I guess. I don't have much experience with these kinds of libraries, so I would need something easy to include in my project and preferably open-source. I am using Python (Flask and SQLAlchemy) and Postgresql, but could also use e.g. Node.js, if needed.
*I also want to clarify what kind of similarity I am looking for. Ideally, it would be semantic similarity, but lexical similarity is fine as well. I would be happy with anything that works okay, is easy to implement, and is as scalable and performant as possible.
Example input sentence:
- I don't like cangaroos.
Example suggestions from the database:
- Cangaroos are not my favorite.
- Cangaroos are evil.
- I once had a cangaroo. Never again.
These suggestions should appear first because 'cangaroo' is not a frequent word in my database, so any string with the word 'cangaroo' should have a high chance appearing in results. It is probably much harder to detect 'don't like', so that part is completely optional for me.
P.s. Could PostgreSQL's full text search do something like this?
Thank you.
PostgreSQL Full-text search cannot do what you're looking for. However, PostgreSQL trigram similarity can do it.
You first need to install the packages with 'trigram similarity' and 'btree_gist', by executing (once) in your database:
I assume you have one table that looks like this one:
This table needs a 'trigram index', to allow the PostgreSQL database to 'index by similarity'. This is accomplished by executing:
To find the answers you're looking for, you execute:
The result that you get is:
Hope this gives you what you want...