Differences between Biopython NCBI/Entrez esummary output and R package output

611 Views Asked by At

I am new to accessing Entrez through Biopython and a couple of R packages (rentrez and reutil). When accessing the 'nuccore' database with esummary, the output fields returned by Biopython are different than that returned by the R packages.

Python:

handle = Entrez.esearch(db='nuccore', term='183844[GPRJ]', retmax=75000)
record = Entrez.read(handle)
id_list = record["IdList"]
search_results = Entrez.read(Entrez.epost("nuccore", id=",".join(id_list), restart=1, retmax=10000))
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
handle1 = Entrez.esummary(db="nuccore", query_key=query_key, WebEnv=webenv)
record1 = Entrez.read(handle1)

The fields returned by Biopython are:

['AccessionVersion','Caption','Comment','CreateDate','Extra','Flags','Gi','Id', 'Item','Length','ReplacedBy','Status','TaxId','Title','UpdateDate']

R (reutil package):

trak <- esearch('183844[GPRJ]', "nuccore", usehistory=TRUE, retmax = 70000)
query_key <- 1
web_env <- "NCID_1_224566406_130.14.18.34_9001_1496371219_1582367639_0MetA0_S_MegaStore_F_1"
esum <- esummary(db="nuccore", querykey = query_key, webenv = web_env, retstart = 1, retmax = 10000)
gtrkr <- content(esum, "parsed")

While the fields returned by R packages reutil and rentrez are: esummary result with 31 items:

['uid', 'caption', 'title', 'extra', 'gi', 'createdate', 'updatedate', 'flags', 'taxid', 'slen', 'biomol', 'moltype', 'topology', 'sourcedb', 'segsetsize', 'projectid', 'genome', 'subtype', 'subname', 'assemblygi', 'assemblyacc', 'tech', 'completeness', 'geneticcode', 'strand', 'organism', 'strain', 'biosample', 'statistics', 'properties', 'oslt']

Thanks in advance.

2

There are 2 best solutions below

3
On

Coming to this late, but as a past contributor to biopython and maintainer of rentrez I feel I need to explain what is going on here.

Biopython is accessing "version 1.0" esummary records by default, and the R packages are fetching "version 2.0" records. There is a brief discussion about the differences between these records in the rentrez help page:

The NCBI offer two distinct formats for summary documents. Version 1.0 is a relatively limited summary of a database record based on a shared Document Type Definition. Version 1.0 summaries are only available as XML and are not available for some newer databases Version 2.0 summaries generally contain more information about a given record, but each database has its own distinct format. 2.0 summaries are available for records in all databases and as JSON and XML files. As of version 0.4, rentrez fetches version 2.0 summaries by default and uses JSON as the exchange format (as JSON object can be more easily converted into native R types). Existing scripts which relied on the structure and naming of the "Version 1.0" summary files can be updated by setting the new ‘version’ argument to "1.0".

And just to demonstrate changing this argument reproduces the results from Biopython.

> eg_gene <- entrez_search(db="nuccore", term='183844[GPRJ]', retmax=1)
> entrez_summary(db="nuccore", id=eg_gene$ids, version="1.0")
esummary result with 13 items:
 [1] Caption          Title            Extra            Gi              
 [5] CreateDate       UpdateDate       Flags            TaxId           
 [9] Length           Status           ReplacedBy       Comment         
[13] AccessionVersion
> entrez_summary(db="nuccore", id=eg_gene$ids)
esummary result with 31 items:
 [1] uid          caption      title        extra        gi          
 [6] createdate   updatedate   flags        taxid        slen        
[11] biomol       moltype      topology     sourcedb     segsetsize  
[16] projectid    genome       subtype      subname      assemblygi  
[21] assemblyacc  tech         completeness geneticcode  strand      
[26] organism     strain       biosample    statistics   properties  
[31] oslt 

Edit -- getting version 2.0 records with Biopython

handle = Entrez.esearch(db="nuccore", term="183844[GPRJ]", retmax=1)
record = Entrez.read(handle) 
handle_two = Entrez.esummary(db="nuccore", id=record["IdList"][0], version="2.0")
Entrez.read(handle_two, validate=False)

.

{'DocumentSummarySet': ListElement([ListElement(['NPMJ00000000', 'Salmonella enterica subsp. enterica serovar Johannesburg strain CFSAN059880, whole genome shotgun sequencing project', 'gi|1235597280|gb|NPMJ00000000.1|NPMJ01000000', '1235597280', '2017/08/22', '2017/08/22', '0', '913076', '48', 'genomic', 'dna', 'linear', 'insd', '0', '186035', '', 'strain|serovar|host|sub_species|country|isolation_source|collection_date|collected_by', 'CFSAN059880|Johannesburg|Bos taurus|enterica|Nigeria|cattle stool|2012|University of Ibadan', '0', '', 'wgs', '', '11', '', 'Salmonella enterica subsp. enterica serovar Johannesburg', 'CFSAN059880', [StringElement('', attributes={'count': '1', 'type': 'all'}), StringElement('', attributes={'count': '3500', 'type': 'blob_size'}), StringElement('', attributes={'count': '1', 'type': 'org'}), StringElement('', attributes={'count': '2', 'type': 'pub'}), StringElement('', attributes={'count': '1', 'subtype': 'unpublished', 'type': 'pub'}), StringElement('', attributes={'count': '1', 'source': 'all', 'type': 'all'}), StringElement('', attributes={'count': '3500', 'source': 'all', 'type': 'blob_size'}), StringElement('', attributes={'count': '1', 'source': 'all', 'type': 'org'}), StringElement('', attributes={'count': '2', 'source': 'all', 'type': 'pub'})], StringElement('1', attributes={'master': '1', 'na': '1'}), StringElement('NPMJ00000000.1', attributes={'indexed': 'yes'}), 'NPMJ00000000.1'], attributes={'uid': '1235597280'})], attributes={'status': 'OK'})}
0
On

To explain the Biopython example:

from Bio import Entrez
handle = Entrez.esearch(db='nuccore', term='183844[GPRJ]', retmax=75000)
record = Entrez.read(handle)
id_list = record["IdList"]
search_results = Entrez.read(Entrez.epost("nuccore", id=",".join(id_list), restart=1, retmax=10000))
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
handle1 = Entrez.esummary(db="nuccore", query_key=query_key, WebEnv=webenv)
record1 = Entrez.read(handle1)

Now this should confirm their are 1000 entries (matching retmax), and each has 15 fields:

print(len(record1))
for entry in record1:
    assert len(entry) == 15
print(record1[0])

That should give:

1000
{'Item': [], 'Id': '1102582672', 'Caption': 'MEKF00000000', 'Title': 'Salmonella enterica subsp. enterica serovar Sandiego strain CFSAN039537, whole genome shotgun sequencing project', 'Extra': 'gi|1102582672|gb|MEKF00000000.1|MEKF01000000[1102582672]', 'Gi': 1102582672, 'CreateDate': '2016/11/14', 'UpdateDate': '2017/07/11', 'Flags': 0, 'TaxId': 0, 'Length': 93, 'Status': 'live', 'ReplacedBy': '', 'Comment': '  ', 'AccessionVersion': 'MEKF00000000.1'}

As an aside, I'm not sure what the 'Item' empty list is from.

Let's check the actual raw XML for the first record using retmax=1

from Bio import Entrez
handle = Entrez.esearch(db='nuccore', term='183844[GPRJ]', retmax=1)
record = Entrez.read(handle)
id_list = record["IdList"]
search_results = Entrez.read(Entrez.epost("nuccore", id=",".join(id_list), restart=1, retmax=10000))
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
handle1 = Entrez.esummary(db="nuccore", query_key=query_key, WebEnv=webenv)
print(handle1.read())

This gives:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD esummary v1 20041029//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20041029/esummary-v1.dtd">
<eSummaryResult>
<DocSum>
    <Id>1102582672</Id>
    <Item Name="Caption" Type="String">MEKF00000000</Item>
    <Item Name="Title" Type="String">Salmonella enterica subsp. enterica serovar Sandiego strain CFSAN039537, whole genome shotgun sequencing project</Item>
    <Item Name="Extra" Type="String">gi|1102582672|gb|MEKF00000000.1|MEKF01000000[1102582672]</Item>
    <Item Name="Gi" Type="Integer">1102582672</Item>
    <Item Name="CreateDate" Type="String">2016/11/14</Item>
    <Item Name="UpdateDate" Type="String">2017/07/11</Item>
    <Item Name="Flags" Type="Integer">0</Item>
    <Item Name="TaxId" Type="Integer">0</Item>
    <Item Name="Length" Type="Integer">93</Item>
    <Item Name="Status" Type="String">live</Item>
    <Item Name="ReplacedBy" Type="String"></Item>
    <Item Name="Comment" Type="String"><![CDATA[  ]]></Item>
    <Item Name="AccessionVersion" Type="String">MEKF00000000.1</Item>
</DocSum>

</eSummaryResult>

i.e. The exact same fields Biopython's Entrez parser is giving you as keys (plus the Id and that Item empty list which puzzled me above).

Are you sure you are comparing like with like here?

Could you give a specific example accession where your R solution has extra data?