How to generate a PSSM matrix using PSI BLAST from BioPython

433 Views Asked by At

Is there any to generate PSSM matrix from PSI BLAST using the python package BioPython? Indeed, I have 8000 sequences in .fasta file. Every sequence length is also long?

I am using this below code:

for fasta in files:
alignment = AlignIO.read(fasta, "fasta")
summary_align = AlignInfo.SummaryInfo(alignment)
consensus = summary_align.dumb_consensus()
my_pssm = summary_align.pos_specific_score_matrix(consensus, chars_to_ignore = ['N', '-'])
file_pssm = fasta+"pssm"
with open(file_pssm) as f:
     f.write(my_pssm)

Is there any better way to do it? The matrix which is shown consisting of 0 and 1 only. I need actual PSSM scoring values (which are in normalized form)

1

There are 1 best solutions below

1
Omar El Atyqy On BEST ANSWER

I used the following code to extract PSSM matrices from a fasta file:

from Bio.Blast.Applications import NcbipsiblastCommandline


# Set the paths to the db
db = "path/to/your/db"
query = "path/to/the/fasta/file"
output_dir = "output/dir/"

# Set the parameters for the psiblast command
psiblast_cline = NcbipsiblastCommandline(cmd="psiblast", query=query, db=db, num_iterations=3, evalue=0.01, num_threads=16, out_ascii_pssm=output_dir, save_each_pssm=True)

# Run psiblast
print("Running psiblast...")
stdout, stderr = psiblast_cline()

# Check the output for any error messages
if stderr:
    print("Error running psiblast:", stderr)
else:
    print("PSSM files generated successfully.")

This should generate a bunch of files in the output directory, each containing the PSSM matrix of the corresponding sequence + some other information. You camn use the code below to extract the sequence and the PSSM matrix from and reutrn it as numpy array:

def pssm_file_to_nparray(pssm_path):
    with open(pssm_path, 'r') as file:
        lines = file.readlines()

        # Skip the header lines
        lines = lines[3:-6]

        # Extract the PSSM values
        pssm_values = []
        pssm_seq = []
        for line in lines:
            line_values = line.strip().split()[2:22]
            seq_values = line.strip().split()[1]
            pssm_values.append([int(value) for value in line_values])
            pssm_seq.append(seq_values)
        
        return "".join(pssm_seq), np.array(pssm_values)

Do know that you need to have the BLAST+ command line tool installed and its bin path set to be able to run the code. Do also note that the PSI-BLAST tool generates sometimes two pssm files for some sequence, so you would need to keep track of sequences that've you already extracted.