Finding the First Common Substring of a set of strings

919 Views Asked by At

I am looking for an implementation of a First Common Substring

Mike is not your average guy. I think you are great.
Jim is not your friend. I think you are great.
Being different is not your fault. I think you are great.

Using a Longest Common Substring implementation (and ignoring punctuation), you would get "I think you are great", but I am looking for the first occurring common substring, in this example:

is not your

Perhaps an implementation that generates and ordered list of all common substrings that I can just take the first from.

Edit

The tokens being compared would be complete words. Looking for a greedy match of the first longest sequence of whole words. (Assuming a suffix tree was used in the approach, each node of the tree would be a word)

2

There are 2 best solutions below

0
On

Here's a little something that will do what you want. You would actually adjust to pre-build your list of strings, pass that in and it will find for you... in this example, the phrase will be based of the string with the shortest string as a baseline.

public void SomeOtherFunc()
{
    List<string> MyTest = new List<string>();
    MyTest.Add( "Mike is not your average guy. I think you are great." );
    MyTest.Add( "Jim is not your friend. I think you are great." );
    MyTest.Add( "Being different is not your fault. I think you are great." );

    string thePhrase = testPhrase( MyTest );
    MessageBox.Show( thePhrase );
}



public string testPhrase(List<string> test)
{

    // start with the first string and find the shortest.
    // if we can't find a short string in a long, we'll never find a long string in short
    // Ex "To testing a string that is longer than some other string" 
    // vs "Im testing a string that is short"
    // Work with the shortest string.
    string shortest = test[0];
    string lastGoodPhrase = "";
    string curTest;
    int firstMatch = 0;
    int lastMatch = 0;
    int allFound;
    foreach (string s in test)
        if (s.Length < shortest.Length)
            shortest = s;

    // Now, we need to break the shortest string into each "word"
    string[] words = shortest.Split( ' ' );

    // Now, start with the first word until it is found in ALL phrases
    for (int i = 0; i < words.Length; i++)
    {
        // to prevent finding "this" vs "is"
        lastGoodPhrase = " " + words[i] + " ";

        allFound = 0;
        foreach (string s in test)
        {
            // always force leading space for string
            if ((" "+s).Contains(lastGoodPhrase))
               allFound++;
            else
               // if not found in ANY string, its not found in all, get out
               break;
        }

        if (allFound == test.Count)
        {
            // we've identified the first matched field, get out for next phase test
            firstMatch = i;
            // also set the last common word to the same until we can test next...
            lastMatch = i;
            break;
        }
    }

    // if no match, get out
    if (firstMatch == 0)
        return "";

    // we DO have at least a first match, now keep looking into each subsequent
    // word UNTIL we no longer have a match.
    for( int i = 1; i < words.Length - firstMatch; i++ )
    {
        // From where the first entry was, build out the ENTIRE PHRASE
        // until the end of the original sting of words and keep building 1 word back
        curTest = " ";
        for (int j = firstMatch; j <= firstMatch + i; j++)
            curTest += words[j] + " ";

        // see if all this is found in ALL strings
        foreach (string s in test)
            // we know we STARTED with a valid found phrase.
            // as soon as a string NO LONGER MATCHES the new phrase,
            // return the last VALID phrase
            if (!(" " + s).Contains(curTest))
                return lastGoodPhrase;

        // if this is still a good phrase, set IT as the newest
        lastGoodPhrase = curTest;
    }

    return lastGoodPhrase;
}
0
On

There are quite a few steps to do this.

  1. Remove Punctuation
  2. Break down Sentences into list of Words
  3. Create string of all combinations of contiguous words (min:1, max:wordCount)
  4. Join the three lists on new list of string (subsentences)
  5. Sort Accordingly.

Code:

static void Main(string[] args)
{
    var sentence1 = "Mike is not your average guy. I think you are great.";
    var sentence2 = "Jim is not your friend. I think you are great.";
    var sentence3 = "Being different is not your fault. I think you are great.";

    //remove all punctuation 
    // http://stackoverflow.com/questions/421616
    sentence1 = new string(
      sentence1.Where(c => !char.IsPunctuation(c)).ToArray());
    sentence2 = new string(
      sentence2.Where(c => !char.IsPunctuation(c)).ToArray());
    sentence3 = new string(
      sentence3.Where(c => !char.IsPunctuation(c)).ToArray());

    //seperate into words
    var words1 = sentence1.Split(new char[] { ' ' },
      StringSplitOptions.RemoveEmptyEntries).ToList();
    var words2 = sentence2.Split(new char[] { ' ' },          
      StringSplitOptions.RemoveEmptyEntries).ToList();
    var words3 = sentence3.Split(new char[] { ' ' }, 
      StringSplitOptions.RemoveEmptyEntries).ToList();

    //create substring list
    var subSentence1 = CreateSubstrings(words1);
    var subSentence2 = CreateSubstrings(words2);
    var subSentence3 = CreateSubstrings(words3);

    //join then like a Sql Table
    var subSentences = subSentence1
        .Join(subSentence2,
            sub1 => sub1.Value,
            sub2 => sub2.Value,
            (sub1, sub2) => new { Sub1 = sub1, 
                                  Sub2 = sub2 })
        .Join(subSentence3,
            sub1 => sub1.Sub1.Value,
            sub2 => sub2.Value,
            (sub1, sub2) => new { Sub1 = sub1.Sub1, 
                                  Sub2 = sub1.Sub2, 
                                  Sub3 = sub2 })
        ;

    //Sorted by Lowest Index, then by Maximum Words
    subSentences = subSentences.OrderBy(s => s.Sub1.Rank)
      .ThenByDescending(s => s.Sub1.Length)
      .ToList();

    //Sort by Maximum Words, then Lowest Index
    /*subSentences = subSentences.OrderByDescending(s => s.Sub1.Length)
        .ThenBy(s => s.Sub1.Rank)
        .ToList();//*/

    foreach (var subSentence in subSentences)
    {
        Console.WriteLine(subSentence.Sub1.Length.ToString() + " " 
          + subSentence.Sub1.Value);
        Console.WriteLine(subSentence.Sub2.Length.ToString() + " " 
          + subSentence.Sub2.Value);
        Console.WriteLine(subSentence.Sub3.Length.ToString() + " " 
          + subSentence.Sub3.Value);
        Console.WriteLine("======================================");
    }

    Console.ReadKey();

}

//this could probably be done better -Erik
internal static List<SubSentence> CreateSubstrings(List<string> words)
{
    var result = new List<SubSentence>();
    for (int wordIndex = 0; wordIndex < words.Count; wordIndex++)
    {
        var sentence = new StringBuilder();
        int currentWord = wordIndex;
        while (currentWord < words.Count - 1)
        {
            sentence.Append(words.ElementAt(currentWord));
            result.Add(new SubSentence() { Rank = wordIndex, 
              Value = sentence.ToString(), 
              Length = currentWord - wordIndex + 1 });
            sentence.Append(' ');
            currentWord++;
        }
        sentence.Append(words.Last());
        result.Add(new SubSentence() { Rank = wordIndex, 
          Value = sentence.ToString(), 
          Length = words.Count - wordIndex });
    }
    return result;
}

internal class SubSentence
{
    public int Rank { get; set; }
    public string Value { get; set; }
    public int Length { get; set; }
}

Result:

3 is not your

3 is not your

3 is not your

======================================

2 is not

2 is not

2 is not

======================================

1 is

1 is

1 is

======================================

2 not your

2 not your

2 not your

======================================

1 not

1 not

1 not

======================================

1 your

1 your

1 your

======================================

5 I think you are great

5 I think you are great

5 I think you are great

======================================

4 I think you are

4 I think you are

4 I think you are

======================================

3 I think you

3 I think you

3 I think you

======================================

2 I think

2 I think

2 I think

======================================

1 I

1 I

1 I

======================================

4 think you are great

4 think you are great

4 think you are great

======================================

3 think you are

3 think you are

3 think you are

======================================

2 think you

2 think you

2 think you

======================================

1 think

1 think

1 think

======================================

3 you are great

3 you are great

3 you are great

======================================

2 you are

2 you are

2 you are

======================================

1 you

1 you

1 you

======================================

2 are great

2 are great

2 are great

======================================

1 are

1 are

1 are

======================================

1 great

1 great

1 great

======================================