Generating a list of all the longest common substrings and a list of variations

351 Views Asked by At

High Level

I am trying to collapse common substrings in a list of sentences and present only the areas that they differ. So taking this:

Please don't kick any of the cats
Please do kick any of the cats
Please don't kick any of the dogs
Please do kick any of the dogs
Please don't kick any of the garden snakes
Please do pet any of the garden snakes

And returning this:

Please [don't|do] [kick|pet] any of the [cats|dogs|garden snakes]

More Details

  • I have been looking at Longest Common Substring algorithms but that seems to only compare two strings.
  • I am only interested in comparing the whole words in the string.
  • Only want to evaluate the strings from left to right.
  • The length of uncommon substrings will not be the same number of words ("cat" vs "garden snake")

I am looking for help on the algorithm. I believe this is a variant of the LCS problem, I think some kind of processing of a suffix tree. Pseudo code that might explain and implementation would be ideal.

Another Example

Please join thirteen of your friends at the Midnight Bash this Friday
Don't forget to join your friend John at the Midnight Bash tomorrow
Don't forget to join your friends John and Julie at the Midnight Bash tonight

turns into:

[Please|Don't forget to]
join
[thirteen of your friends|your friend John|your friends John and Julie]
at the Midnight Bash
[this Friday|tomorrow|tonight]

Maybe this approach

What about this approach ...

for an array of sentences
  loop with the remaining sentence
    find the "first common substring (FCS)"
    split the sentences on the FCS
    every unique phrase before the FCS is part of the set of uncommon phrases
    trim the sentence by the first uncommon phrase
  end loop
2

There are 2 best solutions below

0
On

Interestingly, I had been thinking about creating something like yours long time before, until I realized that this is actually kind of AI. Too many factors to take into considerations: grammar, syntax, situations, errors, etc. But if your input are always so fixed like "Please [A1|A2|..] [B1|B2|..] any of the [C1|C2|..]", then maybe a simple Regex pattern will do: "^Please\s*(?(don't|do))\s*(?\w+)+\s*any of the\s*(?.)*$".

0
On

Map every unique word to a single object. Then build a conditional probability table (see Markov chains) to enumerate counts of how many times a word follows each sequence.