I am working with DNA sequence data in the fasta format and need to create 2 lists containing the organism's names and sequences. I came across the following post Add multiple sequences from a FASTA file to a list in python, but the solution doesn't work properly for me (and I cannot comment yet).
A fasta file is a txt file using the following format. One line starting with a ">" marking the organisms name, followed by multiple lines with sequence data. A fasta file can contain multiple organisms each organised in blocks:
>Organism1
ACTGATGACTGATCGTACGT
ATCGATCGTAGCTACGATCG
ATCATGCTATTGTG
>Organism2
TACTGTAGCTAGTCGTAGCT
ATGACGATCGTACGTCGTAC
TAGCTGACTG
...
The code I wrote with help of the link above is:
data_file = open("multitest.fas","r")
data_tmp = []
a=[] #list for organisms name
b=[] #list for sequence data
for line in data_file:
line = line.rstrip()
line = line.strip("\n").strip("\r")
for i in line:
if line[0] == ">":
a.append(line[1:])
if data_tmp:
b.append("".join(data_tmp))
data_tmp=[]
break
else:
line=line.upper()
if all([k==k.upper() for k in line]):
data_tmp.append(line)
print a
print b
The code works fine, EXCEPT that the sequence of the last organism is not appended to the list b. This seems obvious, as the sequence data is only added when a ">" is encountered. How can I make sure that also the last sequence is added? And why did nobody else has the same problem in the code of the above link? Thanks for any advice!
I've done it with Regex. Hope you find it helpful.