What are good ways to find all the common subsequences of length k of two strings?
Example:
s1= AAGACC
s2= AGATAACCAGGAGCTGC
all common subsequences of length 5: AAGAC AAACC    AGACC AAGCC
What are good ways to find all the common subsequences of length k of two strings?
Example:
s1= AAGACC
s2= AGATAACCAGGAGCTGC
all common subsequences of length 5: AAGAC AAACC    AGACC AAGCC
 On
                        
                            
                        
                        
                            On
                            
                                                    
                    
                Here's a version of the algorithm that uses recursion, a stack of size k and includes two optimizations to skip over characters that have already been seen and to skip sub-subsequences that do not exist.  The strings are not unique (there can be duplicates), so run the output through uniq.
#include <stdio.h>
#include <string.h>
/* s1 is the full first string, s2 is the suffix of the second string
 * (starting after the subsequence at depth r),
 * pos is the positions of chars in s1, r is the recursion depth,
 * and k is the length of subsequences that we are trying to match
 */
void recur(char *s1, char *s2, size_t pos[], size_t r, size_t k)
{
    char seen[256] = {0};       /* have we seen a character in s1 before? */
    size_t p0 = (r == 0) ? 0 : pos[r-1] + 1;    /* start at 0 or pos[r-1]+1 */
    size_t p1 = strlen(s1) - k + r;             /* stop at end of string - k + r */
    size_t p;
    if (r == k)         /* we made it, print the matching string */
    {
        for (p=0; p<k; p++)
            putchar(s1[pos[p]]);
        putchar('\n');
        return;
    }
    for (p=p0; p<=p1; p++)      /* at this pos[], loop through chars to end of string */
    {
        char c = s1[p];         /* get the next char in s1 */
        if (seen[c])
            continue;           /* don't go any further if we've seen this char already */
        seen[c] = 1;
        pos[r] = p;
        char *s = strchr(s2, c);        /* is this char in s2 */
        if (s != NULL)
            recur(s1, s+1, pos, r+1, k);        /* recursively proceed with next char */
    }
}
int main()
{
    char *s1 = "AAGACC";
    char *s2 = "AGATAACCAGGAGCTGC";
    size_t k = 5;
    size_t pos[k];
    if (strlen(s1) < k || strlen(s2) < k)       /* make sure we have at least k chars in each string */
        return 1;       /* exit with error */
    recur(s1, s2, pos, 0, k);
    return 0;
}
The output is:
AAGAC
AAGCC
AAACC
AGACC
 On
                        
                            
                        
                        
                            On
                            
                                                    
                    
                Create a trie of all the subsequences of given length k from s1 and then go over s2 and check for each sequence of length k if it is in the trie.
One relatively straightforward way would be to reconstruct the sequences from the LCS matrix. Here is an O(n^2 * k + x * n) algorithm to do so, where x is the size of the output (i.e. number of common subsequences of length k). It's in C++, but it should be rather easy to translate to C:
There should also be an O(n * (k + x)) way to solve this, based on a slightly different DP approach: Let f(i, k) be the minimum index j such that lcs(i, j) >= k. We have the recurrence
We can also reconstruct the sequences of length k from the matrix f.