Downloading DNA sequence data in R using entrez_fetch: cannot retrieve query

179 Views Asked by At

I'm trying to download DNA sequence data from NCBI using entrez_fetch. With the following code, I perform a search for the IDs of the sequences I need with entrez_search, and then I attempt to download the sequence data in FASTA format:

library(rentrez)
#Search for sequence ids
search <- entrez_search(db = "biosample", 
                        term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]",
                        retmax = 9999, use_history = T)

search$ids
length(search$ids)
search$web_history

#Download sequence data
ecoli_fasta <- entrez_fetch(db = "nuccore",
                            web_history = search$web_history,
                            rettype = "fasta")

When I do this, I get the following error:

Error: HTTP failure: 400
Cannot+retrieve+query+from+history

I don't understand what this means and Googling hasn't led me to an answer.

I tried using a different package (ape) and the function read.GenBank to download the sequences as an alternative, but this method only managed to download about 1000 of the 12000 sequences I needed. I would like the use entrez_fetch if possible - does anyone have any insight for me?

1

There are 1 best solutions below

2
On

This may be a starter.

Also be aware that queries to genome databases can return massive amounts of data, so be sure to limit your queries.

Build search web history

library(rentrez)

search <- entrez_search(db="nuccore", 
                        term="Escherichia coli[Organism]", 
                        use_history = T)

Use web history to fetch data

cat(entrez_fetch(db="nuccore", 
  web_history=search$web_history, rettype="fasta",  retstart=24, retmax=100))
>pdb|7QQ3|I Chain I, 23S ribosomal RNA
NGTTAAGCGACTAAGCGTACACGGTGGATGCCCTGGCAGTCAGAGGCGATGAAGGACGTGCTAATCTGCG
ATAAGCGTCGGTAAGGTGATATGAACCGTTATAACCGGCGATTTCCGAATGGGGAAACCCAGTGTGTTTC
GACACACTATCATTAACTGAATCCATAGGTTAATGAGGCGAACCGGGGGAACTGAAACATCTAAGTACCC
CGAGGAAAAGAAATCAACCGAGATTCCCCCAGTAGCGGCGAGCGAACGGGGAGCAGCCCAGAGCCTGAAT
CAGTGTGTGTGTTAGTGGAAGCGTCTGGAAAGGCGCGCGATACAGGGTGACAGCCCCGTACACAAAAATG
CACATGCTGTGAGCTCGATGAGTAGGGCGGGACACGTGGTATCCTGTCTGAATATGGGGGGACCATCCTC
CAAGGCTAAATACTCCTGACTGACCGATAGTGAACCAGTACCGTGAGGGAAAGGCGAAAAGAACCCCGGC
...

Use a loop to cycle through sequences, e.g

for(i in seq(1, 300, 100)){
  cat(entrez_fetch(db="nuccore", 
    web_history=search$web_history, rettype="fasta",  retstart=i, retmax=100))
}