How to output only unique gene id's?

189 Views Asked by At

I am working on a project using the following command within nano:

from Bio import SeqIO
import sys
import re 

     fasta_file = (sys.argv[1])
        for myfile in SeqIO.parse(fasta_file, "fasta"):
          if len(myfile) > 250:
           gene_id = myfile.id
           mylist = re.match(r"H149xcV_\w+_\w+_\w+", gene_id)
           print (">"+list.group(0)) 

and its providing with the following outout:

    >H149xcV_Fge342_r3_h2_d1
    >H149xcV_bTr423_r3_h2_d1
    >H149xcV_kN893_r3_h2_d1
    >H149xcV_DNp021_r3_h2_d1
    >H149xcV_JEP3324_r3_h2_d1
    >H149xcV_JEP3324_r3_h2_d1
    >H149xcV_JEP3324_r3_h2_d1
    >H149xcV_JEP3324_r3_h2_d1
    >H149xcV_SRt424234_r3_h2_d1
    >H149xcV_SRt424234_r3_h2_d1
    >H149xcV_SRt424234_r3_h2_d1
    >H149xcV_SRt424234_r3_h2_d1

How can I change my command so that it provides me with and that are UNIQUE:

>H149xcV_Fge342_r3_h2
>H149xcV_bTr423_r3_h2
>H149xcV_kN893_r3_h2
>H149xcV_DNp021_r3_h2
>H149xcV_JEP3324_r3_h2
>H149xcV_SRt424234_r3_h2
3

There are 3 best solutions below

6
On BEST ANSWER

You could use a capturing group and use that in the replacement.

To prevent unnecessary backtracking, you can exclude the underscore from the word characters using a negated character class [^\W_]+

(H149xcV_[^\W_]+_[^\W_]+)_[^\W_]+

Regex demo

list = re.match(r"(H149xcV_[^\W_]+_[^\W_]+)_[^\W_]+", gene_id)
print (">"+list.group(1)) 
2
On

you can be explicit with classes as \w+ will match [a-zA-Z0-9_] so even you have multiple \w+ it doesn't matter.

H149xcV_[a-zA-Z0-9]+_[a-zA-Z0-9]+_[a-zA-Z0-9]+

Regex Demo

try to use Regex Cheatsheet when developing regex, it helps a lot.

a little clever way:

(H149xcV(_[a-zA-z0-9]+){3})

(                   start of group 1
H149xcV             match literal text
(                   start of sub-group 1
_                   match underscore
[a-zA-Z0-9]         word with digits
+                   more than one occurrence
)                   end of sub-group 1
{3}                 should repeat 3 times 
)                   end of group 1

Regex Demo

1
On

If you're only interested in part of a regex match, use groups to single that part out:

from Bio import SeqIO
import sys
import re 

fasta_file = (sys.argv[1])
for myfile in SeqIO.parse(fasta_file, "fasta"):
    if len(myfile) > 250:
        gene_id = myfile.id
        list = re.match(r"(H149xcV_\w+_\w+)_\w+", gene_id)
        print (">"+list.group(1)) 

That should get you the output you need.

You also asked about ensure that there are no duplicates in the output. To do that, you'd need to keep a record of what you already wrote, which means they all end up in memory - if you're doing that anyway, you may as well build the list in memory and write it once complete. This is assuming your dataset isn't so large that it wouldn't fit in memory.

A solution would look like:

from Bio import SeqIO
import sys
import re 

fasta_file = (sys.argv[1])
# by collecting results in a set, they are guaranteed to be unique
result = set()
for myfile in SeqIO.parse(fasta_file, "fasta"):
    if len(myfile) > 250:
        gene_id = myfile.id
        m = re.match(r"(H149xcV_\w+_\w+)_\w+", gene_id)
        if m.group(1) not in result:
            print(">"+m.group(1))
        result.add(m.group(1))

Another approach would be to build result and print it once complete, but that has the disadvantage of the result no longer being in the same order as the original, although it would be a little faster (since you no longer have to check if m.group(1) not in result for every line).