Extracting viral host from Genbank record or Entrez query

85 Views Asked by At

I would like to be able to see the viral host organism from a number of Genbank records. I have tried this through downloading Genbank full files and reading them with Biopython.SeqIO.read(), and I have also tried querying the database through Entrez.efetch this is an example using only one ID:

$ pip install biopython
from Bio import Entrez, SeqIO

id = 'CY238774.1'
handle = Entrez.efetch(db='nucleotide', id=id, rettype='gb', retmode='text')
record = SeqIO.read(handle, 'gb')

When I look up this id record on NCBI through the web browser, I can see that in the record is says 'host=Homo sapiens'. This text is also present on the downloaded .gb file. However, I cannot find this information anywhere in the SeqRecord object created above. It appears this information is being lost when the SecRecord is created. I have checked all the class attributes.

Is there a way to extract this information from the SeqRecord?

1

There are 1 best solutions below

1
J_H On

When I look up this id record on NCBI ...

Let's review that record:

$ curl -i 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=CY238774.1&rettype=gb&retmode=text&tool=biopython'
...
content-type: text/plain
...
FEATURES             Location/Qualifiers
     source          1..1002
                     /organism="Influenza A virus (A/Washington/27/2017(H3N2))"
                     /mol_type="viral cRNA"
                     /strain="A/Washington/27/2017"
                     /serotype="H3N2"
                     /host="Homo sapiens"
...

So we're looking for Features --> Source --> Host.

Now let's switch to the API. Turns out that the very first Feature that came back has type Source.

>>> from pprint import pp
>>> 
>>> record.features[0]
SeqFeature(SimpleLocation(ExactPosition(0), ExactPosition(1002), strand=1), type='source', qualifiers=...)
>>> 
>>> record.features[0].type
'source'
>>> 
>>> pp(record.features[0].qualifiers)
{'organism': ['Influenza A virus (A/Washington/27/2017(H3N2))'],
 'mol_type': ['viral cRNA'],
 'strain': ['A/Washington/27/2017'],
 'serotype': ['H3N2'],
 'host': ['Homo sapiens'],
 'db_xref': ['taxon:1984973'],
 'segment': ['7'],
 'country': ['USA: Washington'],
 'collection_date': ['01-Mar-2017'],
 'note': ['passage details: S2 (2017-03-26)']}
>>> 
>>> record.features[0].qualifiers['host'][0]
'Homo sapiens'

Ta da! It was lurking within.