Avoiding Regex OverflowError From Large IUPAC Ambiguous DNA Search

205 Views Asked by At

I am searching for scaffolds (around 10s of Kb) within a chromosome (around 10s of Mb) from the same assembly. Both contain IUPAC ambiguities. I have thus far been using

from Bio.SeqUtils import *
nt_search(chromosome, scaffold)

However, there is an instance where the forward strand search is fine, but the search of its reverse complement gives a regular expression overload error.

from Bio.Seq import Seq
from Bio.SeqUtils import *

def findCoordinates (self):
    ''' Performs scaffold string search within chromosomes. Returns scaffold coordinates within said chromosome. '''

    for chromosome in self.chromosomes.keys():
        for scaffold in self.scaffolds.keys():
            # search forward strand.
            nt_forward = nt_search(self.chromosomes[chromosome], self.scaffolds[scaffold])
            if len(nt_forward) > 1:
                startCoord = nt_forward[1] + 1
                endCoord = (startCoord + len(self.scaffolds[scaffold]))
                # save coordinates

            else: # search reverse strand
                scaffold_seq = Seq(self.scaffolds[scaffold])
                reverse_seq = scaffold_seq.reverse_complement()
                nt_reverse = nt_search(self.chromosomes[chromosome], str(reverse_seq))
                if  len(nt_reverse) > 1:
                    startCoord = nt_reverse[1] + 1
                    endCoord = (startCoord + len(self.scaffolds[scaffold]))
                    # save coordinates
                    self.scaffolds[scaffold] = str(scaffold_seq.reverse_complement())

and I get the following error:

Traceback (most recent call last):
  File "scaffoldPlacer.py", line 98, in <module>
z.findCoordinates()
  File "scaffoldPlacer.py", line 60, in findCoordinates
nt_reverse = nt_search(self.chromosomes[chromosome], str(reverse_seq))
  File "/usr/local/lib/python2.7/site-packages/Bio/SeqUtils/__init__.py", line 191, in nt_search
m = re.search(pattern, s)
  File "/usr/local/lib/python2.7/re.py", line 142, in search
return _compile(pattern, flags).search(string)
  File "/usr/local/lib/python2.7/re.py", line 243, in _compile
p = sre_compile.compile(pattern, flags)
  File "/usr/local/lib/python2.7/sre_compile.py", line 523, in compile
groupindex, indexgroup
OverflowError: regular expression code size limit exceeded

As I mentioned before, this error only occurs when the reverse complement regular expression is searched, therefore the forward orientation search completed without errors.

Is there a way to avoid this error, or is there a better way to perform a DNA string search with regard to IUPAC ambiguities.

Thank you

1

There are 1 best solutions below

1
On

Following this (Python's Regular Expression Source String Length), it seems that huge patterns break the re.compile. I'm on a x64 Linux, and I'm unable to "break" the re.compile even with re.compile("x"*5000000), while comenters on the linked question claims it breaks with 65536, in line with your 10Ks queries.

Can you try using another computer or OS?

Or maybe you can split the queries in two (or check your maximum query size with the above codes) and then check if the coordinates of the matches are "contiguous", adding some lines to your code.

Edit. I've found a System where the error reproduces. In the file python2.7/sre_compile.py at the beginning you'll find this lines (python 2.7.0):

if _sre.CODESIZE == 2:
    MAXCODE = 65535
else:
    MAXCODE = 0xFFFFFFFFL

where _sre is a builtin (a C file). If your python version was compiled with a _sre.c that limits the size of the regex to 65535 having a CODESIZE == 2, you have to upgrade your python interpreter.