What is the best algorithms to compare Strings and put the similar together?

64 Views Asked by At

I'm trying to group redundancies in a dataset for some analysis. My primary tool for analysis are their titles.

I might have things like "blue bird" "big blue bird" "brown dog" "red dog", etc.

In this case, I want to group "blue bird" and "big blue bird" together but none of the other elements should be grouped.

I know about String Metrics but, in general, how effective are they on phrases as opposed to single words or noisy strings and which would be an effective solution for this problem?

2

There are 2 best solutions below

0
On

You could use the same logic that people usually put in programs to sort an array, fix a variable (in this case would be a string that we would use the first word) and compare it with the strings that you have, always looking for an equal word, if it is equal you should place in a separate vector or in a specific order.

However , doing so you would spend a lot of time and probably not the best way to go because it would go phrase by phrase, word by word, letter by letter. Otherwise it may seem helpful to separate the strings by the initial letter of the first word in large groups. This way, you spend less time in your search for repeated words, which would optimize the use of memory.

I found this paper from Carnegie Mellon University, it seems very interesting, it talks about this problem, you should take a better look: String Metric

0
On

String metrics don't care if your words contain empty spaces or not. Thus phrases are mostly just longer strings than words (in this regard), so string metrics work just as well if you are performing a fuzzy search (allthough you might want to search for every word individually).

Since you seem to be looking for exact matches though, i would recommend building a suffix tree from the concatenation of your titles. You can then search that tree for each of your title and build title-groups if you got more than one match. However you will need to decide what you want to do with combinations like

  • blue bird
  • big blue bird
  • small blue bird

Following the brown/red dog example, you would not want to group "big blue bird" with "small blue bird", but "blue bird" would be grouped with both of these.