I was given a FASTA formatted file (like from this site: http://www.uniprot.org/proteomes/) that gives various protein coding sequences within a certain bacteria. I have been asked to give a full count and the relative percentage of each of the single code amino acids contained within the file, and return the results like:
L: 139002 (10.7%)
A: 123885 (9.6%)
G: 95475 (7.4%)
V: 91683 (7.1%)
I: 77836 (6.0%)
What I have so far:
#!/usr/bin/python
ecoli = open("/home/file_pathway").read()
counts = dict()
for line in ecoli:
words = line.split()
for word in words:
if word in ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"]:
if word not in counts:
counts[word] = 1
else:
counts[word] += 1
for key in counts:
print key, counts[key]
I believe that doing this is retrieving all of the instances of the capital letters and not just those contained within the protein amino acid string, how can I limit it just to the coding sequence? I am also having trouble writing how to calculate the each single code over the total
The only lines that don't contain what you want start with
>
just ignore those:You could also use a collections.Counter dict as the lines only contain what you are interested in: