Returning sequences with two mismatches from original string

325 Views Asked by At

I am looking for some code to return all sequences with two mismatches from the original string, for the purposes of finding parts of a protein sequence that are similar to the original sequence I input. For example, searching for LKLD in LELFLKEF should return: LELF LKEF LFLK I have looked at various python approaches to do this, but I can't seem to make any work.

Search for string allowing for one mismatch in any location of the string

String regex two mismatches Python

Ambiguous substring with mismatches

1

There are 1 best solutions below

2
avigil On

A simple approach would be to roll through the sequence and calculate hamming distance for each alignment of the query 'LKLD' to the subject sequence 'LELFLKEF'. There is a sample implementation of hamming distance calculation in the linked wikipedia article. Once you have that your code would do something like:

# hamming distance
d = lambda s1, s2: sum(e1 != e2 for e1, e2 in zip(s1, s2))

subject = 'LELFLKEF'
query = 'LKLD'
for i in range(len(subject)-len(query)+1):
    aligned_subject = subject[i:i+len(query)]
    if d(aligned_subject, query) == 2:
         print(aligned_subject)

Output:

LELF
LFLK
LKEF

Note that this is a bit of a naive solution with plenty of room for optimization, but it will work for simple cases and reasonably small strings. A condensed version that produces a list:

s='LELFLKEF'
q='LKLD'
d= lambda s1, s2: sum(e1 != e2 for e1, e2 in zip(s1, s2))
[s[i:i+len(q)] for i in range(len(s)-len(q)+1) if d(s[i:i+len(q)],q) == 2]

The for loop rolls through all possible ungapped alignments of your two strings:

0
LELFLKEF
||||
LKLD
 1
LELFLKEF
 ||||
 LKLD
  2
LELFLKEF
  ||||
  LKLD
   3
LELFLKEF
   ||||
   LKLD
    4
LELFLKEF
    ||||
    LKLD

There are also many implementations for the problem of alignment of biological sequences so you might also want to explore some more involved techniques that handle things like gapped alignment and more complicated modeling of substitutions