I am querying PubMED with a long list of PMIDs using R. Because entrez_fetch can only do a certain number at a time, I have broken down my ~2000 PMIDs into one list with several vectors (each about 500 in length). When I query PubMED, I am extracting information from XML files for each publication. What I would like to have in the end is something like this:
Original.PMID Publication.type
26956987 Journal.article
26956987 Meta.analysis
26956987 Multicenter.study
26402000 Journal.article
25404043 Journal.article
25404043 Meta.analysis
Each publication has a unique PMID but there may be several publication types associated with each PMID (as seen above). I can query the PMID number from the XML file, and I can get the publication types of each PMID. What I have problems with is repeating the PMID x number of times so that each PMID is associated with each of the publication type it has. I am able to do this if I don't have my data in a list with multiple sublists (e.g., if I have 14 batches, each as its own data frame) by getting the number of children nodes from the parent PublicationType node. But I can't seem to figure out how to do this for within a list.
My code so far is this:
library(rvest)
library(tidyverse)
library(stringr)
library(regexr)
library(rentrez)
library(XML)
pubmed<-my.data.frame
into.batches<-function(x,n) split(x,cut(seq_along(x),n,labels=FALSE))
batches<-into.batches(pubmed.fwd$PMID, 14)
headings<-lapply(1:14, function(x) {paste0("Batch",x)})
names(batches)<-headings
fwd<-sapply(batches, function(x) entrez_fetch(db="pubmed", id=x, rettype="xml", parsed=TRUE))
trial1<-lapply(fwd, function(x)
list(pub.type = xpathSApply(x, "//PublicationTypeList/PublicationType", xmlValue),
or.pmid = xpathSApply(x, "//ArticleId[@IdType='pubmed']", xmlValue)))
trial1 is what I am having problems with. This gives me a list where within each Batch, I have a vector for pub.type and a vector for or.pmid but they're different lengths.
I am trying to figure out how many children publication types there are for each publication, so I can repeat the PMID that many number of times. I am currently using the following code which does not do what I want:
trial1<-lapply(fwd, function(x)
list(childnodes = xpathSApply(xmlRoot(x), "count(.//PublicationTypeList/PublicationType)", xmlChildren)))
Unfortunately, this just tells me the total number of children nodes for each batch, not for each publication (or pmid).
I would split the XML results into separate Article nodes and apply xpath functions to get pmids and pubtypes.
Also, NCBI says to use the HTTP POST method if using more than 200 UIDs.
rentrez
does not support POSTing, but you can run that with a few lines of code.First, you need a vector with 1000s of Pubmed IDs (6171 from the microbial genome table)
Post the ids to NCBI using httr
POST
.Parse the results following the code in
entrez_post
to get the web history.Finally, fetch up to 10K records (or loop through using the
retstart
option if you have more than 10K)These only take a second to run on my laptop.
And one last comment. If you want one row per article, I usually create a function that combines multiple tags into a delimited list and adds NAs for missing nodes.