I am searching for scaffolds (around 10s of Kb) within a chromosome (around 10s of Mb) from the same assembly. Both contain IUPAC ambiguities. I have thus far been using
from Bio.SeqUtils import *
nt_search(chromosome, scaffold)
However, there is an instance where the forward strand search is fine, but the search of its reverse complement gives a regular expression overload error.
from Bio.Seq import Seq
from Bio.SeqUtils import *
def findCoordinates (self):
''' Performs scaffold string search within chromosomes. Returns scaffold coordinates within said chromosome. '''
for chromosome in self.chromosomes.keys():
for scaffold in self.scaffolds.keys():
# search forward strand.
nt_forward = nt_search(self.chromosomes[chromosome], self.scaffolds[scaffold])
if len(nt_forward) > 1:
startCoord = nt_forward[1] + 1
endCoord = (startCoord + len(self.scaffolds[scaffold]))
# save coordinates
else: # search reverse strand
scaffold_seq = Seq(self.scaffolds[scaffold])
reverse_seq = scaffold_seq.reverse_complement()
nt_reverse = nt_search(self.chromosomes[chromosome], str(reverse_seq))
if len(nt_reverse) > 1:
startCoord = nt_reverse[1] + 1
endCoord = (startCoord + len(self.scaffolds[scaffold]))
# save coordinates
self.scaffolds[scaffold] = str(scaffold_seq.reverse_complement())
and I get the following error:
Traceback (most recent call last):
File "scaffoldPlacer.py", line 98, in <module>
z.findCoordinates()
File "scaffoldPlacer.py", line 60, in findCoordinates
nt_reverse = nt_search(self.chromosomes[chromosome], str(reverse_seq))
File "/usr/local/lib/python2.7/site-packages/Bio/SeqUtils/__init__.py", line 191, in nt_search
m = re.search(pattern, s)
File "/usr/local/lib/python2.7/re.py", line 142, in search
return _compile(pattern, flags).search(string)
File "/usr/local/lib/python2.7/re.py", line 243, in _compile
p = sre_compile.compile(pattern, flags)
File "/usr/local/lib/python2.7/sre_compile.py", line 523, in compile
groupindex, indexgroup
OverflowError: regular expression code size limit exceeded
As I mentioned before, this error only occurs when the reverse complement regular expression is searched, therefore the forward orientation search completed without errors.
Is there a way to avoid this error, or is there a better way to perform a DNA string search with regard to IUPAC ambiguities.
Thank you
Following this (Python's Regular Expression Source String Length), it seems that huge patterns break the re.compile. I'm on a x64 Linux, and I'm unable to "break" the
re.compile
even withre.compile("x"*5000000)
, while comenters on the linked question claims it breaks with 65536, in line with your 10Ks queries.Can you try using another computer or OS?
Or maybe you can split the queries in two (or check your maximum query size with the above codes) and then check if the coordinates of the matches are "contiguous", adding some lines to your code.
Edit. I've found a System where the error reproduces. In the file
python2.7/sre_compile.py
at the beginning you'll find this lines (python 2.7.0):where
_sre
is a builtin (a C file). If your python version was compiled with a_sre.c
that limits the size of the regex to 65535 having aCODESIZE == 2
, you have to upgrade your python interpreter.