How to extract the first hit elements from an XML NCBI BLAST file?

8.1k Views Asked by At

Im trying to extract only the first hit from an NCBI xml BLAST file. next I would like to get only the first HSP. at the final stage I would like to get these based on best score. to make things clear here a sample of the xml file:

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastx</BlastOutput_program>
  <BlastOutput_version>blastx 2.2.22 [Sep-27-2009]</BlastOutput_version>
  <BlastOutput_reference>~Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, ~Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), ~&quot;Gapped BLAST and PSI-BLAST: a new generation of protein database search~programs&quot;,  Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
  <BlastOutput_db>/Applications/blast/db/viral1.protein.faa</BlastOutput_db>
  <BlastOutput_query-ID>lcl|1_0</BlastOutput_query-ID>
  <BlastOutput_query-def>DSAD-090629_plate11A01a.g1 CHROMAT_FILE: DSAD-090629_plate11A01a.g1 PHD_FILE: DSAD-090629_plate11A01a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A01a DIRECTION: rev</BlastOutput_query-def>
  <BlastOutput_query-len>1024</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_matrix>BLOSUM62</Parameters_matrix>
      <Parameters_expect>1e-05</Parameters_expect>
      <Parameters_gap-open>11</Parameters_gap-open>
      <Parameters_gap-extend>1</Parameters_gap-extend>
      <Parameters_filter>F</Parameters_filter>
    </Parameters>
  </BlastOutput_param>
  <BlastOutput_iterations>
    <Iteration>
      <Iteration_iter-num>1</Iteration_iter-num>
      <Iteration_query-ID>lcl|1_0</Iteration_query-ID>
      <Iteration_query-def>DSAD-090629_plate11A01a.g1 CHROMAT_FILE: DSAD-090629_plate11A01a.g1 PHD_FILE: DSAD-090629_plate11A01a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A01a DIRECTION: rev</Iteration_query-def>
      <Iteration_query-len>1024</Iteration_query-len>
      <Iteration_stat>
        <Statistics>
          <Statistics_db-num>68007</Statistics_db-num>
          <Statistics_db-len>19518578</Statistics_db-len>
          <Statistics_hsp-len>0</Statistics_hsp-len>
          <Statistics_eff-space>0</Statistics_eff-space>
          <Statistics_kappa>0.041</Statistics_kappa>
          <Statistics_lambda>0.267</Statistics_lambda>
          <Statistics_entropy>0.14</Statistics_entropy>
        </Statistics>
      </Iteration_stat>
      <Iteration_message>No hits found</Iteration_message>
    </Iteration>
    <Iteration>
<Iteration>
      <Iteration_iter-num>6</Iteration_iter-num>
      <Iteration_query-ID>lcl|6_0</Iteration_query-ID>
      <Iteration_query-def>DSAD-090629_plate11A05a.g1 CHROMAT_FILE: DSAD-090629_plate11A05a.g1 PHD_FILE: DSAD-090629_plate11A05a.g1.phd.1 CHEM: term DYE: big TIME: Thu Sep 17 15:33:59 2009 TEMPLATE: DSAD-090629_plate11A05a DIRECTION: rev</Iteration_query-def>
      <Iteration_query-len>1068</Iteration_query-len>
      <Iteration_hits>
        <Hit>
          <Hit_num>1</Hit_num>
          <Hit_id>gnl|BL_ORD_ID|23609</Hit_id>
          <Hit_def>gi|38707884|ref|NP_945016.1| Putative ribose-phosphate pyrophosphokinase [Enterobacteria phage Felix 01]</Hit_def>
          <Hit_accession>23609</Hit_accession>
          <Hit_len>293</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>49.2914</Hsp_bit-score>
              <Hsp_score>116</Hsp_score>
              <Hsp_evalue>5.15408e-06</Hsp_evalue>
              <Hsp_query-from>580</Hsp_query-from>
              <Hsp_query-to>792</Hsp_query-to>
              <Hsp_hit-from>202</Hsp_hit-from>
              <Hsp_hit-to>273</Hsp_hit-to>
              <Hsp_query-frame>-1</Hsp_query-frame>
              <Hsp_identity>26</Hsp_identity>
              <Hsp_positive>45</Hsp_positive>
              <Hsp_gaps>2</Hsp_gaps>
              <Hsp_align-len>73</Hsp_align-len>
              <Hsp_qseq>MHIIGDVE--GRTCILVDDMVDTAGTLCHAAKALKERGAAKVYAYCTHPVLSGRAIENIENSVLDELVVTNTI</Hsp_qseq>
              <Hsp_hseq>MRILDDVDLTDKTVMILDDICDGGRTFVEAAKHLREAGAKRVELYVTHGIFS-KDVENLLDNGIDHIYTTNSL</Hsp_hseq>
              <Hsp_midline>M I+ DV+   +T +++DD+ D   T   AAK L+E GA +V  Y TH + S + +EN+ ++ +D +  TN++</Hsp_midline>
            </Hsp>
          </Hit_hsps>
        </Hit>
        <Hit>
          <Hit_num>2</Hit_num>
          <Hit_id>gnl|BL_ORD_ID|2466</Hit_id>
          <Hit_def>gi|51557505|ref|YP_068339.1| large tegument protein [Suid herpesvirus 1]</Hit_def>
          <Hit_accession>2466</Hit_accession>
          <Hit_len>3084</Hit_len>
          <Hit_hsps>
            <Hsp>
              <Hsp_num>1</Hsp_num>
              <Hsp_bit-score>48.9062</Hsp_bit-score>
              <Hsp_score>115</Hsp_score>
              <Hsp_evalue>6.70494e-06</Hsp_evalue>
              <Hsp_query-from>369</Hsp_query-from>
              <Hsp_query-to>875</Hsp_query-to>
              <Hsp_hit-from>2312</Hsp_hit-from>
              <Hsp_hit-to>2465</Hsp_hit-to>
              <Hsp_query-frame>-2</Hsp_query-frame>
              <Hsp_identity>52</Hsp_identity>
              <Hsp_positive>70</Hsp_positive>
              <Hsp_gaps>4</Hsp_gaps>
              <Hsp_align-len>173</Hsp_align-len>
          <Hsp_qseq>APESQEPGASTWRSSTSVVKKGQPSQK*CTSSVTSKAVPASWSTTWSTLPAPCATPPKR*KSAAPPRSTPTAPTRCCPAAPSRTSRIPSWTSWWSPTPSRCPLRRSPARVFASSTSPR-SSPKRSAASATKNRSAP---CSAKRNWPDHTAPPRAGLFALPPEAGRKPQGGLV</Hsp_qseq>
          <Hsp_hseq>APPAQKPPAQPATAAATTAPKATPQTQPPTRAQTQTAPPPPSAAT-----AAAQVPPQ------PPSSQPAAKPRGAPPAPPAPP--PPSAQTTLPRPAAPPAPPPPS---AQTTLPRPAPPPPSAPAATPTPPAPGPAPSAKKSDGDRIVEPKAG---APPDVRDAKFGGKV</Hsp_hseq>
          <Hsp_midline>AP +Q+P A    ++ +   K  P  +  T + T  A P   + T     A    PP+      PP S P A  R  P AP      P       P P+  P    P+   A +T PR + P  SA +AT    AP    SAK++  D    P+AG    PP+      GG V</Hsp_midline>
        </Hsp>
      </Hit_hsps>
    </Hit>
  </Iteration_hits>
  <Iteration_stat>
    <Statistics>
      <Statistics_db-num>68007</Statistics_db-num>
      <Statistics_db-len>19518578</Statistics_db-len>
      <Statistics_hsp-len>0</Statistics_hsp-len>
      <Statistics_eff-space>0</Statistics_eff-space>
      <Statistics_kappa>0.041</Statistics_kappa>
      <Statistics_lambda>0.267</Statistics_lambda>
      <Statistics_entropy>0.14</Statistics_entropy>
    </Statistics>
  </Iteration_stat>
</Iteration>

basically each query search creates an Iteration element. each iteration can have multiple hit which in turn can have multiple HSPs. I would like to get only the first hit and it's first HSP from each iteration. if the BLAST found no hits I would like to ignore the iteration. I worked up this simple code:

#!/usr/bin/env python
from elementtree.ElementTree import parse
from elementtree import ElementTree as ET
file = open("/Applications/blast/blanes_viral_nr_results.xml", "r")
save_file = open("/Applications/blast/Blast_parse_ET.txt", 'w')
tree = parse(file)
elem = tree.getroot()
print elem
Per_ID = ()

save_file.write('>%s\t%s\t%s\t%s\t%s\t%s\t\n\n\n\n' % ("It_Num\t", "It_ID\t", "Hit_Def\t", "Num\t", "ID\t", "ACC\t"))
iteration = tree.findall('BlastOutput_iterations/Iteration')
for iteration in iteration:
   for hit in iteration.findall('Iteration_hits/Hit'):
  It_Num = iteration.findtext('Iteration_iter-num')
  It_ID = iteration.findtext('Iteration_query-def')
  Hit_Def = hit.findtext('Hit_def')
  Num =  hit.findtext('Hit_num')
  ID = hit.findtext('Hit_id')
  DEF =  hit.findtext('Hit_def')
  ACC = hit.findtext('Hit_accession')
  save_file.write('>%s\t%s\t%s\t%s\t%s\t%s\t' % (It_Num, It_ID[12:26], Hit_Def[1:10], Num, ID, ACC,))
  for hsp in hit.findall('Hit_hsps'):
        HSPN = hsp.findtext('Hsp/Hsp_num')
        identities = hsp.findtext('Hsp/Hsp_identity')
        #print 'id: ', identities.rjust(4),
        length = hsp.findtext('Hsp/Hsp_align-len')
        #print 'len:', length.rjust(4),
        Per_ID = int(identities) * 100.0 / int(length)
        #print hsp.findtext('Hsp/Hsp_qseq')[:50]
        #print hsp.findtext('Hsp/Hsp_midline')[:50]
        #print hsp.findtext('Hsp/Hsp_hseq')[:50]
        save_file.write('%s\t%s\t%s\%st\n' % ('***', '%', HSPN, Per_ID))
  save_file.write('n\n' % ())

any help would be greatly appriciated!

3

There are 3 best solutions below

2
On BEST ANSWER

While building your own parser may be "fun" there is already a package out there that can parse BLAST xml files ... it can even do the intermediate calling of a local BLAST instance for you if you so desire.

The main site is here: http://biopython.org/wiki/Biopython

and the XML BLAST parser is here: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc82

Something like:

from Bio.Blast import NCBIXML
with open('xml/results/file') as handle:
    all_records = NCBIXML.parse(handle)
    first_record = all_records.next()

Should work. I generally love the BioPython parsers and writers but I do not like the class-structure organization. So I generally just use the parser and extract the info I need into my own structure. YMMV

Hope that helps.

0
On

If I understand your basic requirement then you want to get the top hit/HSP of the query protein/nucleotide sequence.Why dont you install the standalone blast on your system with formatted nr/nt database. Type the options

blastall -p {blast programme blastp for protein,blastn for nucleotide} -d {database} -i {input query} -v 1{for top hit} -b 1{alignment of the top hit with query} -m 7{xml blast output} -o example.xml

open the xml output file in MS excel where you can see a tabulated form of the blastoutput with top single hit to each query sequence

0
On

I'll second what JudoWill suggested - work smarter, not harder with the BioPython parser. This should get you a little further:

from Bio.Blast import NCBIXML
blast = NCBIXML.parse(open('results.xml','rU'))
for record in blast:
    if record.alignments:
        # to print the "best" matches e-score
        print record.alignments[0].hsps[0].expect
        # to print the "best" matches bit-score
        print record.alignments[0].hsps[0].score
        break

This will stop after the first Query (returning the first and best match). Since I suspect you may want results for other queries within the same file, just remove the break from the last line.