I'm trying to group redundancies in a dataset for some analysis. My primary tool for analysis are their titles.
I might have things like "blue bird" "big blue bird" "brown dog" "red dog", etc.
In this case, I want to group "blue bird" and "big blue bird" together but none of the other elements should be grouped.
I know about String Metrics but, in general, how effective are they on phrases as opposed to single words or noisy strings and which would be an effective solution for this problem?
You could use the same logic that people usually put in programs to sort an array, fix a variable (in this case would be a string that we would use the first word) and compare it with the strings that you have, always looking for an equal word, if it is equal you should place in a separate vector or in a specific order.
However , doing so you would spend a lot of time and probably not the best way to go because it would go phrase by phrase, word by word, letter by letter. Otherwise it may seem helpful to separate the strings by the initial letter of the first word in large groups. This way, you spend less time in your search for repeated words, which would optimize the use of memory.
I found this paper from Carnegie Mellon University, it seems very interesting, it talks about this problem, you should take a better look: String Metric