Parsing fasta with specific names in the header

390 Views Asked by At

I have a txt file containing multiple fasta sequences ( and I am willing to parse the sequences together with gene names especially. Can you please help with the selection of sequences with specific names in the header. Thank you

Original data in the txt file.

lcl|NC_045512.2_gene_6 [gene=ORF6] [locus_tag=GU280_gp06] [db_xref=GeneID:43740572] [location=27202..27387] [gbkey=Gene] ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGGACTTTTAAAG

Expected data after parsing in python

ORF6 ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGGACTTTTAAAG

I have used this and I was able to obtain

***from Bio import SeqIO
for record in SeqIO.parse("mytext.txt", 'fasta'):
    print(record.name) 
    print(record.seq)***

Obtained results were like this.

lcl|NC_045512.2_gene_6 ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGGACTTTTAAAG

2

There are 2 best solutions below

1
On

here i tried it in python regular expression....

here I were grouped the gene and the sequence for two sequences.....

import re
f=open('seq',"r")
input=(f.readlines())
print(input)
patt=".+?\[gene=(.+?)]\s\[locus_tag=.+?]\s\[db_xref=GeneID:.+?]\s\ 
[location=.+?]\s\[gbkey=.+?]\s(.+)"
for i in input:
    x=re.search(patt.decode('utf-8'),i.decode('utf- 
    8'),re.DOTALL|re.MULTILINE|re.IGNORECASE|re.UNICODE)
    print x.groups()

the output will be....

group1=(u'ORF6',u'ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGG
 ACTTTTAAAG\n')
 group2=(u'ORF6',u'ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTAT
 GAGGACTTTTAAAG\n')
0
On

I still confuse about your question since I did not study biology.

This answer is purely for souce_text == expected_text

from io import StringIO
from Bio import SeqIO  # pip install biopython  # https://biopython.org/wiki/Download
import re

source_text = """\
>lcl|NC_045512.2_gene_6 [gene=ORF6] [locus_tag=GU280_gp06][db_xref=GeneID:43740572] [location=27202..27387] [gbkey=Gene]
ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGGACTTTTAAAG
"""

expected_text = """\
ORF6
ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGGACTTTTAAAG
"""

regex = re.compile("\[gene=[\w]*\] ")  # \w: [a-zA-Z0-9_]
result = ''
for record in SeqIO.parse(StringIO(source_text), 'fasta'):
    # print(record.name)
    gene_name = regex.search(record.description).group()  # [ORF6]
    gene_name = gene_name[gene_name.find('=')+1: -2]  # ORF6
    print(gene_name)
    print(record.seq)
    result += gene_name + '\n' + record.seq + '\n'

if result == expected_text:
    print('ok')
ORF6
ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGGACTTTTAAAG
ok

REFERENCE

The following is a reference for people who are not familiar with biopython.