Strange behaviour of Python difflib library for sequence matcher

868 Views Asked by Mr. T At 05 May 2018 at 18:46

I am somewhat puzzled by a strange behaviour in the difflib library. I try to find overlapping sequences in strings (actually Fasta sequences from a Rosalind task) to glue them together. The code adapted from here works well with a smaller length of the string (for the sake of the clarity, I construct here an example from a common substring a):

import difflib

def glue(seq1, seq2):     
    s = difflib.SequenceMatcher(None, seq1, seq2)
    start1, start2, overlap = s.find_longest_match(0, len(seq1), 0, len(seq2)) 
    if start1 + overlap == len(seq1) and start2 == 0:
        return seq1 + seq2[overlap:]
    #no overlap found, return -1
    return -1


a = "ABCDEFG"
s1 = "XXX" + a
s2 = a + "YYY"

print(glue(s1, s2))

Output

XXXABCDEFGYYY

But when the string is longer, difflib doesn't find the match any more.

a = "AGGTGTGCCTGTGTCTATACATCGTACGCGGGAAGGTCCAAGTTAACATGGGGTACTGTAATGCACACGTACGCGGGAAGGTCCAAGTTAACTACGAAACGCGAGCCCATCTTTGCCGGTGTTAACTTGCTGTCAGGTGTTTGGCAAGGATCTTTGTTTGCCGGTGTTAACTTGCTGTCAGGTGTTTGGCCGGTGTTAACTTGCTGTCAGATGCGCGCCACGGCCAAATTCTAGGCACGCCAAATTCTAGGCACTTTAAGTGGTTCGATGATCCACGATGGTAAGCCAGCCGTACTTGC"

s1 = "XXX" + a
s2 = a + "YYY"

print(glue(s1, s2))

Output

-1

Why does this happen and how can you use difflib with longer strings?

Original Q&A

There are 1 best solutions below

Mr. T On 05 May 2018 at 18:54 BEST ANSWER

I noticed that this behaviour starts, when the "ATCG" string is longer than 200. Looking for this number on the difflib documentation page brought me to this passage:

Automatic junk heuristic: SequenceMatcher supports a heuristic that automatically treats certain sequence items as junk. The heuristic counts how many times each individual item appears in the sequence. If an item’s duplicates (after the first one) account for more than 1% of the sequence and the sequence is at least 200 items long, this item is marked as “popular” and is treated as junk for the purpose of sequence matching. This heuristic can be turned off by setting the autojunk argument to False when creating the SequenceMatcher.

Since the sequence only contains ATCG, they are of course more than 1% of a sequence of 200 characters and hence considered junk. Changing the code to

import difflib

def glue(seq1, seq2):     
    s = difflib.SequenceMatcher(None, seq1, seq2, autojunk = False)
    start1, start2, overlap = s.find_longest_match(0, len(seq1), 0, len(seq2)) 
    if start1 + overlap == len(seq1) and start2 == 0:
        return seq1 + seq2[overlap:]
    return -1

gets rid of this behaviour and allows for substring matching without restriction.
I thought I leave this here on SO, because I wasted hours combing through my code to find this problem.

Strange behaviour of Python difflib library for sequence matcher

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in BIOPYTHON

Related Questions in DIFFLIB

Related Questions in ROSALIND

Trending Questions

Popular # Hahtags

Popular Questions