Say I have the following two strings in my database:
(1) 'Levi Watkins Learning Center - Alabama State University'
(2) 'ETH Library'
My software receives free text inputs from a data source, and it should match those free texts to the pre-defined strings in the database (the ones above).
For example, if the software gets the string 'Alabama University', it should recognize that this is more similar to (1) than it is to (2).
At first, I thought of using a well-known string metric like Levenshtein-Damerau or Trigrams, but this leads to unwanted results as you can see here:
http://fuzzy-string.com/Compare/Transform.aspx?r=ETH+Library&q=Alabama+University
Difference to (1): 37
Difference to (2): 14
(2) wins because it is much shorter than (1), even though (1) contains both words (Alabama and University) of the search string.
I also tried it with Trigrams (using the Javascript library fuzzySet), but I got similar results there.
Is there a string metric that would recognize the similarity of the search string to (1)?
Keyword Counting
You haven't really defined why you think option one is a "closer" match, at least not in any algorithmic sense. It seems like you're basing your expectations on the notion that option one has more matching keywords than option two, so why not just match based on the number of keywords in each string?
For example, using Ruby 2.0:
This will print:
which matches your expectations of the corpus. You might want to make additional passes on the results using other algorithms to refine the results or to break ties, but this should at least get you pointed in the right direction.