What indexer do I use to find the list in the collection that is most similar to my list?

143 Views Asked by JaseC At 04 June 2025 at 07:02

Lets say I have my list of ingredients: {'potato','rice','carrot','corn'}

and I want to return lists from a database that are most similar to mine:

{'beans','potato','oranges','lettuce'}, {'carrot','rice','corn','apple'} {'onion','garlic','radish','eggs'}

My query would return this first: {'carrot','rice','corn','apple'}

I've used Solr, and have looked at CloudSearch, ElasticSearch, Algolia, Searchify and Swiftype. These engines only seem to let me put in one query string and then filter by other facets.

In a real scenario my search list will be about 200 items long and will be matching against about a million lists in my database.

What technology should I use to accomplish what I want to do?

Should I look away from search indexers and more towards database-esque things like mongo, map reduce, hadoop... All I know are the names of other technologies and I just need someone to point me in the right direction on what technology path I should be exploring for this.

With so much data I can't really loop through it, I need to query everything at once.

Original Q&A

There are 1 best solutions below

BlueM On 12 June 2015 at 10:21 BEST ANSWER

I wonder what keeps you from trying it with Solr, as Solr provides much of what you need. You can declare the field as type="string" multiValued="true and save each list item as a value. Then, when querying, you specify each of the items in the list to look for as a search term for that field, and Solr will – by default – return the closest match. If you need exact control over what will be regarded as a match (e.g. at least 40% of the terms from the search list have to be in a matching list) you can use the mm EDisMax parameter, cf. Solr Wiki

Having said that, I must add that I’ve never searched for 200 query terms (do I unerstand correctly that the list whose contents should be searched will contain about 200 items?) and do not know how well that performs. But I guess that setting up a test core and filling it with random lists using a script should not take more than a few hours, so it should be possible to evaluate the performance of this approach without investing too much time.

What indexer do I use to find the list in the collection that is most similar to my list?

There are 1 best solutions below

Related Questions in SEARCH

Related Questions in INDEXING

Related Questions in SOLR

Related Questions in LEVENSHTEIN-DISTANCE

Trending Questions

Popular # Hahtags

Popular Questions