Human genome is made of 24 different chromosomes (actually 23 pairs= 46 chromosomes). this chromosomes are called 1
, 2
, 3
, ..., 22
, X
and Y
. Each chromosome is a very long string of 'G'
, 'C'
, 'A'
and 'T'
characters (for example chromosome 1 is made of almost 24 milion characters). I have each chromosome in a file (strand of chromosome 1 is in 1.fa
file).
*.fa
file is called fasta file which is an standard file for DNA strands information. this file has a structure like this:
>gi|568815591|ref|NC_000007.14| Homo sapiens chromosome 7, GRCh38 Primary Assembly
CATTGCACTCCAGCCTGGGCAAAAACAGCGAAACTCCGTCTCAAAAAAAAAAAAAAGAAAAAAT
TAGCCAGGCATGGTGAAGTTGCAGTGAGCTGAGACTGCACCATTGCACTCCAGCCTGGGTAGCA
As you see this kind of files contain a first line that give us some information about the source of GCAT string.
I wrote this code to count the GC content (ratio of number of G+C characters to all characters):
homo_sapiens_chromosomes_List=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, "x", "y"]
for i in homo_sapiens_chromosomes_List:
i=str (i)
file_open= open (i+".fa", "r") #Opening each file
file_read= file_open.read() #Reading file
file_read=file_read.upper() #Uppercase characters
G=float(file_read.count("G")) #Count G in file
C=float(file_read.count("C")) #Count C in file
A=float(file_read.count("A")) #Count A in file
T=float(file_read.count("T")) #Count A in file
print "There are %d Gs, %d Cs, %d As and %d Ts, in the DNA strand, of chromosome number %s." % (G, C, A, T, i)
print "GC content of this chromosome is:", (G+C)*100/(A+T+G+C), "percent" #Prints GC Content
Now I have some questions:
How can I make this code more efficient (faster, shorter or...)
When I try to count GC content, the first line of fasta file which is not a part of DNA strand, is also counted. I wrote this function to delete this line before counting the GC content (this code comes after this line:
file_read=file_read.upper()
):
Code:
def Fasta_Clean(): #a function to delete the first line of fasta file
global file_read
if file_read.isalnum()==False:
file_read=file_read[1:]
Fasta_Clean()
Fasta_Clean()
But this code returned: "RuntimeError: maximum recursion depth exceeded in cmp"
, so I wrote this one:
def Fasta_Clean(): #a function to delete the first line of fasta file
global file_read
fas=file_read[0:number]
if fas.isalnum()==False:
file_read=file_read[1:]
Fasta_Clean()
Fasta_Clean()
Now, when variable fas
is more than fas=file_read[0:90]
, I see "RuntimeError: maximum recursion depth exceeded in cmp"
again. How can I solve this problem?
- If the fasta file is made of more than one strand and has an structure like this (in this example file is made of three different strands):
Example:
>gi|568815591|ref|NC_000007.14| Homo sapiens chromosome 7, GRCh38 Primary Assembly
GCGAAACTCCGTCTCAAAAAAAAAAAAAAGAAAAAATCATTGCACTCCAGCCTGGGCAAAAACA
CACCATTGCACTCCAGCCTGGGTAGCATAGCCAGGCATGGTGAAGTTGCAGTGAGCTGAGACTG
>gi|568815864|ref|NC_000009.14| Homo sapiens chromosome 8, GRCh38 Primary Assembly
CATTGCACTCCAGCCTGGGCAAAAACAGCGAAACTCCGTCTCAAAAAAAAAAAAAAGAAAAAAT
TAGCCAGGCATGGTGAAGTTGCAGTGAGCTGAGACTGCACCATTGCACTCCAGCCTGGGTAGCA
>gi|568815325|ref|NC_000009.14| Homo sapiens chromosome 9, GRCh38 Primary Assembly
CTGGGCAAAAACAGCGAAACTCCGTCTCAAAAAAAAAAAAAAGAAACATTGCACTCCAGCAAAT
GTGAGCTGAGACTGCACCATTGCTAGCCAGGCATGGTGAAGTTGCAACTCCAGCCTGGGTAGCA
In this conditions how can I count GC content of each strand separately?
Here this should be just about all the code you need.
If you are using windows or mac you may need to change the
\n\n