I have following task: Starting with 30 character long pattern sequence (it is actually DNA sequence, lest call it P30) I need to find in a text file all lines starting (^agacatacag... )with a exact P30, then with 29 last characters of the 30, 28 and up to 10 characters. All what I need to do is simply remove the first character of the pattern and go on searching. For the simplicity, I currently require exact matching, but allowing 1 mismatch for longer (20-30 character long patterns) would be better.
My current, rather slow solution is to create a shell file with one truncated pattern per line, and grep[1] it. Which means that I am reading huge, few GB text file 20x and this can take a day+.
I can switch to python, create a list/tuple with all required patterns and then read the file just once, looping instead 20x per each sequence, speeding things using pypy.
- Question 1: is there any regex which will be faster than such loop?
- Question 2: does it make sens to speed it up by switching to faster, compiled language? (I am trying to comprehend Dlang)
[1] Because it is DNA sequence and the input to be searched is in FASTQ format I am using fqgrep: https://github.com/indraniel/fqgrep with tre library: https://github.com/laurikari/tre/
edit_1 example of a changing (shortening pattern). Just first few steps/shorter patterns shown:
^abcde
^bcde
^cde
Or if you prefer it as DNA:
^GATACCA
^ATACCA
^TACCA
edit_2 Simple grep does not really cut it. I need to post-process each 4-lines of FASTQ format from which only line #2 matches. If I am not to use fqgrep, then I have to:
read 4 lines of the input
- check if line #2 (sequence) starts with any of the 20 patterns( P30-P10)
- if I got the match, I need to cut out the first N characters of the lines #2 and #4, where N stands for the length of the matching pattern
- print out on output/write to file lines #1-$4 in no match do nothing
For the in-house solution I can try use GNU parallel splitting the input file in say 4M of lies chunks and speeding things up that way. But if I want to make it usable by others each new software I am asking end-users to install ads an extra level of complication.
** edit 3 ** Simple example of thee regexes and matching lines from Vyctor:
starting P30 regex
^agacatacagagacatacagagacatacag
matching sequence:
^agacatacagagacatacagagacatacagGAGGACCA
P29:
^gacatacagagacatacagagacatacag
matching sequence:
^gacatacagagacatacagagacatacagGACCACCA
P28:
^acatacagagacatacagagacatacag
matching sequence:
^acatacagagacatacagagacatacagGATTACCA
I remove the characters/DNA bases from the left (or 5-prime end in DNA speak), because this is the way these sequences are degraded by real enzymes. The regex sequence by itself is not interesting once it is being found. The desired output is the read sequence after the regex. In the above examples it is in UPERCASE, which then can be mapped in the next step to the genome. It should be stressed that apart from this toy example I am getting way longer, a priori unknown and varied sequences after the regex pattern. In the real world, I do not have to deal with upper/lower case characters for DNA (everything is uppercase), but I am likely to encounter Ns (= unknown DNA base) in the sequences I am searching for the patterns. These can be ignored in the first approximation but for a more sensitive version of the algorithm probably should be dealt at as simple mismatches. In an ideal scenario one would not count simple mismatches at a given position but calculate more complex penalties taking into account DNA sequence quality values stored in line #4 of each 4 line long sequence record stored in FASTQ format: http://en.wikipedia.org/wiki/FASTQ_format#Quality
But this is way more complex, and so far the method "take only reads with perfect match to regex" was good enough and made the subsequent steps way easier to analyze.
You can generate a regex programmatically to look like below.
Its just a progressive alternation of line start or next character.
What this will give you is the ability to do a single pass search.
All you have to do is get the string length of the match to tell you
where you are.
Note - use Multi-line mode.
For example, matches these:
But not these:
Update
This is a graphic illustration that a single regex can replace all 30.
There is really no need for 30 separate regex's, all you need is 1 constant regex.
In this example the cluster group is replaced with capture groups so you can see what it is doing.
Input, 6 lines:
Output: