I am somewhat puzzled by a strange behaviour in the difflib library. I try to find overlapping sequences in strings (actually Fasta sequences from a Rosalind task) to glue them together. The code adapted from here works well with a smaller length of the string (for the sake of the clarity, I construct here an example from a common substring a):
import difflib
def glue(seq1, seq2):
s = difflib.SequenceMatcher(None, seq1, seq2)
start1, start2, overlap = s.find_longest_match(0, len(seq1), 0, len(seq2))
if start1 + overlap == len(seq1) and start2 == 0:
return seq1 + seq2[overlap:]
#no overlap found, return -1
return -1
a = "ABCDEFG"
s1 = "XXX" + a
s2 = a + "YYY"
print(glue(s1, s2))
Output
XXXABCDEFGYYY
But when the string is longer, difflib doesn't find the match any more.
a = "AGGTGTGCCTGTGTCTATACATCGTACGCGGGAAGGTCCAAGTTAACATGGGGTACTGTAATGCACACGTACGCGGGAAGGTCCAAGTTAACTACGAAACGCGAGCCCATCTTTGCCGGTGTTAACTTGCTGTCAGGTGTTTGGCAAGGATCTTTGTTTGCCGGTGTTAACTTGCTGTCAGGTGTTTGGCCGGTGTTAACTTGCTGTCAGATGCGCGCCACGGCCAAATTCTAGGCACGCCAAATTCTAGGCACTTTAAGTGGTTCGATGATCCACGATGGTAAGCCAGCCGTACTTGC"
s1 = "XXX" + a
s2 = a + "YYY"
print(glue(s1, s2))
Output
-1
Why does this happen and how can you use difflib with longer strings?
I noticed that this behaviour starts, when the "ATCG" string is longer than 200. Looking for this number on the
difflibdocumentation page brought me to this passage:Since the sequence only contains
ATCG, they are of course more than 1% of a sequence of 200 characters and hence considered junk. Changing the code togets rid of this behaviour and allows for substring matching without restriction.
I thought I leave this here on SO, because I wasted hours combing through my code to find this problem.