I am looking for some code to return all sequences with two mismatches from the original string, for the purposes of finding parts of a protein sequence that are similar to the original sequence I input. For example, searching for LKLD in LELFLKEF should return: LELF LKEF LFLK I have looked at various python approaches to do this, but I can't seem to make any work.
Search for string allowing for one mismatch in any location of the string
A simple approach would be to roll through the sequence and calculate hamming distance for each alignment of the query 'LKLD' to the subject sequence 'LELFLKEF'. There is a sample implementation of hamming distance calculation in the linked wikipedia article. Once you have that your code would do something like:
Output:
Note that this is a bit of a naive solution with plenty of room for optimization, but it will work for simple cases and reasonably small strings. A condensed version that produces a list:
The
forloop rolls through all possible ungapped alignments of your two strings:There are also many implementations for the problem of alignment of biological sequences so you might also want to explore some more involved techniques that handle things like gapped alignment and more complicated modeling of substitutions