High Level
I am trying to collapse common substrings in a list of sentences and present only the areas that they differ. So taking this:
Please don't kick any of the cats
Please do kick any of the cats
Please don't kick any of the dogs
Please do kick any of the dogs
Please don't kick any of the garden snakes
Please do pet any of the garden snakes
And returning this:
Please [don't|do] [kick|pet] any of the [cats|dogs|garden snakes]
More Details
- I have been looking at Longest Common Substring algorithms but that seems to only compare two strings.
- I am only interested in comparing the whole words in the string.
- Only want to evaluate the strings from left to right.
- The length of uncommon substrings will not be the same number of words ("cat" vs "garden snake")
I am looking for help on the algorithm. I believe this is a variant of the LCS problem, I think some kind of processing of a suffix tree. Pseudo code that might explain and implementation would be ideal.
Another Example
Please join thirteen of your friends at the Midnight Bash this Friday
Don't forget to join your friend John at the Midnight Bash tomorrow
Don't forget to join your friends John and Julie at the Midnight Bash tonight
turns into:
[Please|Don't forget to]
join
[thirteen of your friends|your friend John|your friends John and Julie]
at the Midnight Bash
[this Friday|tomorrow|tonight]
Maybe this approach
What about this approach ...
for an array of sentences
loop with the remaining sentence
find the "first common substring (FCS)"
split the sentences on the FCS
every unique phrase before the FCS is part of the set of uncommon phrases
trim the sentence by the first uncommon phrase
end loop
Interestingly, I had been thinking about creating something like yours long time before, until I realized that this is actually kind of AI. Too many factors to take into considerations: grammar, syntax, situations, errors, etc. But if your input are always so fixed like "Please [A1|A2|..] [B1|B2|..] any of the [C1|C2|..]", then maybe a simple Regex pattern will do: "^Please\s*(?(don't|do))\s*(?\w+)+\s*any of the\s*(?.)*$".