Why does my for loop in Python never stop running?

124 Views Asked by At

I am creating a code that takes accession numbers from an existing excel file and ultimately running it through BLAST using Entrez. I am very new to Python and pretty much just learning as a I write. Here is what my code looks like right now, it currently will run and run but never complete and never return any errors. Any help would be greatly appreciated!

import pandas
from Bio import SeqIO
from Bio import Entrez
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML 

genesexcel=pandas.read_excel(r'C:\Users\bmn45\OneDrive - Drexel University\Monoterpenoid Indole Alkaloids\MIA_genes.xlsb', sheet_name='Data S16')  
#import the excel into variable

acc_num=genesexcel['Accession number'].values
#store values from first column of excel
print(acc_num)

for number in acc_num:

    handle = Entrez.esearch(db='protein', term=number, retmax="20")
    #searching the Genbank database using the accession number

    Entrez_result = Entrez.read(handle)
  
    ID_List = Entrez_result['IdList']
    #storing the ID number(s) from the Entrez search of the accession number

    handle2 = Entrez.efetch(db='protein', id=ID_List, rettype='gb')
    #uses efecth to get the full sequence record/file using the searched ID number
    #currently, the ID value is hardcoded, database is protein and the return file type (rettype) is set as a genbank file(gb)

    seq_record = SeqIO.read(handle2, "genbank")
    #storing the full sequence record as a genbank file
    seq_ID=(seq_record.name) 
    #storing the name of the sequence record (the full sequence ID)

    result_handle = NCBIWWW.qblast("tblastn", "nt", seq_ID, entrez_query="txid4056[ORGN]")
    #searching the nucleotide database ("nt") from protein sequence using tblastn, refined search related to Apocynacae (taxID4056)

    with open("my_blast_result.xml", "w") as out_handle:
        out_handle.write(result_handle.read())
    result_handle.close()
    #saving the blast results into a file

    result_handle = open("my_blast_result.xml")
    #loading saved BLAST results

    blast_records = NCBIXML.parse(result_handle)
    #parsing (analying) through the returned hits

    print("Alignments for sequence", next(blast_records).query)
    for alignment in blast_records:
        print("Accession number:", alignment.accession)
        print("Sequence:", alignment.title)
        print("Length:", alignment.length)
        print()
  
    print(result_handle.read())

I have tried running the code with a hard-coded accession number and ID (without looping through) and it worked just fine so I know my problem lies with how my loop is set up.

2

There are 2 best solutions below

0
elvis bernard On

I am currently experiencing the same kind of issue: my blast searches take on average 3 minutes to return a result but sometimes, after 12 hours, it is still searching. restarting the exact same blast ( or doing it on the website) may just take a few minutes. My advice would be :

  • to put some timestamp at each step so you know if your program is still moving forward.
  • save your result each time your have done with your result so when blast send you some error, you don't have to restart everything.
0
Umar On

It seems like the problem arises from repeatedly overwriting and reading the same BLAST results file within the loop. To fix this, move the file handling outside the loop to avoid constant overwrites and multiple reads. After the loop, use result_handle.read() just once to prevent reading the file repeatedly.

Here's an updated version of your code:

import pandas as pd

# Creating a dummy dataframe
#data = {'Accession number': ['ABC123', 'DEF456', 'GHI789']}
#genesexcel = pd.DataFrame(data)

# Printing the dummy dataframe
print(genesexcel)

import pandas
from Bio import SeqIO
from Bio import Entrez
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML 

genesexcel = pandas.read_excel(r'C:\Users\bmn45\OneDrive - Drexel University\Monoterpenoid Indole Alkaloids\MIA_genes.xlsb', sheet_name='Data S16')  
acc_num = genesexcel['Accession number'].values
print(acc_num)

# Open the result file outside the loop
with open("my_blast_result.xml", "w") as out_handle:
    for number in acc_num:
        handle = Entrez.esearch(db='protein', term=number, retmax="20")
        Entrez_result = Entrez.read(handle)
        ID_List = Entrez_result['IdList']
        handle2 = Entrez.efetch(db='protein', id=ID_List, rettype='gb')
        seq_record = SeqIO.read(handle2, "genbank")
        seq_ID = seq_record.name 

        result_handle = NCBIWWW.qblast("tblastn", "nt", seq_ID, entrez_query="txid4056[ORGN]")

        # Save each BLAST result in the same file
        out_handle.write(result_handle.read())

# Read the file and parse it after the loop
with open("my_blast_result.xml") as result_handle:
    blast_records = NCBIXML.parse(result_handle)
    print("Alignments for sequence", next(blast_records).query)
    for alignment in blast_records:
        print("Accession number:", alignment.accession)
        print("Sequence:", alignment.title)
        print("Length:", alignment.length)
        print()