Parsing fasta with specific names in the header

413 Views Asked by elegans At 09 April 2020 at 09:11

I have a txt file containing multiple fasta sequences ( and I am willing to parse the sequences together with gene names especially. Can you please help with the selection of sequences with specific names in the header. Thank you

Original data in the txt file.

lcl|NC_045512.2_gene_6 [gene=ORF6] [locus_tag=GU280_gp06] [db_xref=GeneID:43740572] [location=27202..27387] [gbkey=Gene] ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGGACTTTTAAAG

Expected data after parsing in python

ORF6 ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGGACTTTTAAAG

I have used this and I was able to obtain

***from Bio import SeqIO
for record in SeqIO.parse("mytext.txt", 'fasta'):
    print(record.name) 
    print(record.seq)***

Obtained results were like this.

lcl|NC_045512.2_gene_6 ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGGACTTTTAAAG

Original Q&A

There are 2 best solutions below

Lakshmi Ram On 09 April 2020 at 12:13

here i tried it in python regular expression....

here I were grouped the gene and the sequence for two sequences.....

import re
f=open('seq',"r")
input=(f.readlines())
print(input)
patt=".+?\[gene=(.+?)]\s\[locus_tag=.+?]\s\[db_xref=GeneID:.+?]\s\ 
[location=.+?]\s\[gbkey=.+?]\s(.+)"
for i in input:
    x=re.search(patt.decode('utf-8'),i.decode('utf- 
    8'),re.DOTALL|re.MULTILINE|re.IGNORECASE|re.UNICODE)
    print x.groups()

the output will be....

group1=(u'ORF6',u'ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGG
 ACTTTTAAAG\n')
 group2=(u'ORF6',u'ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTAT
 GAGGACTTTTAAAG\n')

Carson On 17 April 2020 at 03:33

I still confuse about your question since I did not study biology.

This answer is purely for souce_text == expected_text

from io import StringIO
from Bio import SeqIO  # pip install biopython  # https://biopython.org/wiki/Download
import re

source_text = """\
>lcl|NC_045512.2_gene_6 [gene=ORF6] [locus_tag=GU280_gp06][db_xref=GeneID:43740572] [location=27202..27387] [gbkey=Gene]
ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGGACTTTTAAAG
"""

expected_text = """\
ORF6
ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGGACTTTTAAAG
"""

regex = re.compile("\[gene=[\w]*\] ")  # \w: [a-zA-Z0-9_]
result = ''
for record in SeqIO.parse(StringIO(source_text), 'fasta'):
    # print(record.name)
    gene_name = regex.search(record.description).group()  # [ORF6]
    gene_name = gene_name[gene_name.find('=')+1: -2]  # ORF6
    print(gene_name)
    print(record.seq)
    result += gene_name + '\n' + record.seq + '\n'

if result == expected_text:
    print('ok')

ORF6
ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGGACTTTTAAAG
ok

REFERENCE

The following is a reference for people who are not familiar with biopython.

What is SeqIO.parse
What is fasta

more test data

Parsing fasta with specific names in the header

There are 2 best solutions below

REFERENCE

Related Questions in PYTHON

Related Questions in REGEX

Related Questions in PYTHON-3.X

Related Questions in BIOLOGICAL-NEURAL-NETWORK

Trending Questions

Popular # Hahtags

Popular Questions