Calculating number of xmlchildren under each parent node for a list in R

715 Views Asked by At

I am querying PubMED with a long list of PMIDs using R. Because entrez_fetch can only do a certain number at a time, I have broken down my ~2000 PMIDs into one list with several vectors (each about 500 in length). When I query PubMED, I am extracting information from XML files for each publication. What I would like to have in the end is something like this:

    Original.PMID     Publication.type
    26956987          Journal.article
    26956987          Meta.analysis
    26956987          Multicenter.study
    26402000          Journal.article
    25404043          Journal.article
    25404043          Meta.analysis

Each publication has a unique PMID but there may be several publication types associated with each PMID (as seen above). I can query the PMID number from the XML file, and I can get the publication types of each PMID. What I have problems with is repeating the PMID x number of times so that each PMID is associated with each of the publication type it has. I am able to do this if I don't have my data in a list with multiple sublists (e.g., if I have 14 batches, each as its own data frame) by getting the number of children nodes from the parent PublicationType node. But I can't seem to figure out how to do this for within a list.

My code so far is this:

library(rvest)
library(tidyverse)
library(stringr)
library(regexr)
library(rentrez)
library(XML)

pubmed<-my.data.frame

into.batches<-function(x,n) split(x,cut(seq_along(x),n,labels=FALSE))
batches<-into.batches(pubmed.fwd$PMID, 14)
headings<-lapply(1:14, function(x) {paste0("Batch",x)})
names(batches)<-headings
fwd<-sapply(batches, function(x) entrez_fetch(db="pubmed", id=x, rettype="xml", parsed=TRUE))
trial1<-lapply(fwd, function(x) 
  list(pub.type = xpathSApply(x, "//PublicationTypeList/PublicationType", xmlValue),
  or.pmid = xpathSApply(x, "//ArticleId[@IdType='pubmed']", xmlValue)))

trial1 is what I am having problems with. This gives me a list where within each Batch, I have a vector for pub.type and a vector for or.pmid but they're different lengths.

I am trying to figure out how many children publication types there are for each publication, so I can repeat the PMID that many number of times. I am currently using the following code which does not do what I want:

trial1<-lapply(fwd, function(x) 
  list(childnodes = xpathSApply(xmlRoot(x), "count(.//PublicationTypeList/PublicationType)", xmlChildren)))

Unfortunately, this just tells me the total number of children nodes for each batch, not for each publication (or pmid).

2

There are 2 best solutions below

0
Chris S. On BEST ANSWER

I would split the XML results into separate Article nodes and apply xpath functions to get pmids and pubtypes.

pmids <- c(11677608, 22328765 ,11337471)
res <- entrez_fetch(db="pubmed", rettype="xml", id = pmids)
doc <- xmlParse(res)
x <-  getNodeSet(doc, "//PubmedArticle")
x1 <- sapply(x, xpathSApply, ".//ArticleId[@IdType='pubmed']", xmlValue)
x2 <- sapply(x, xpathSApply, ".//PublicationType", xmlValue)
data.frame( pmid= rep(x1, sapply(x2, length) ), pubtype = unlist(x2) )
      pmid                          pubtype
1 11677608                  Journal Article
2 11677608 Research Support, Non-U.S. Gov't
3 22328765                  Journal Article
4 22328765 Research Support, Non-U.S. Gov't
5 11337471                  Journal Article

Also, NCBI says to use the HTTP POST method if using more than 200 UIDs. rentrez does not support POSTing, but you can run that with a few lines of code.

First, you need a vector with 1000s of Pubmed IDs (6171 from the microbial genome table)

library(readr)
x <- read_tsv( "ftp://ftp.ncbi.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt", 
                na = "-", quote = "")
ids <- unique( x$`Pubmed ID` )
ids <- ids[ids < 1e9 & !is.na(ids)]

Post the ids to NCBI using httr POST.

uri = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?"
response <- httr::POST(uri, body= list(id = paste(ids, collapse=","), db = "pubmed"))

Parse the results following the code in entrez_post to get the web history.

 doc  <-   xmlParse( httr::content(response, as="text", encoding="UTF-8") )
 result <- xpathApply(doc, "/ePostResult/*", xmlValue)
 names(result) <- c("QueryKey", "WebEnv")
 class(result) <- c("web_history", "list")

Finally, fetch up to 10K records (or loop through using the retstart option if you have more than 10K)

res <- entrez_fetch(db="pubmed", rettype="xml", web_history=result)
doc <- xmlParse(res)

These only take a second to run on my laptop.

articles <- getNodeSet(doc, "//PubmedArticle")
x1 <- sapply(articles, xpathSApply, ".//ArticleId[@IdType='pubmed']", xmlValue)
x2 <- sapply(articles, xpathSApply, ".//PublicationType", xmlValue)

data_frame( pmid= rep(x1, sapply(x2, length) ), pubtype = unlist(x2) )
# A tibble: 9,885 × 2
       pmid                                  pubtype
      <chr>                                    <chr>
 1 11677608                          Journal Article
 2 11677608         Research Support, Non-U.S. Gov't
 3 12950922                          Journal Article
 4 12950922         Research Support, Non-U.S. Gov't
 5 22328765                          Journal Article
 ...

And one last comment. If you want one row per article, I usually create a function that combines multiple tags into a delimited list and adds NAs for missing nodes.

xpath2 <-function(x, ...){
    y <- xpathSApply(x, ...)
    ifelse(length(y) == 0, NA,  paste(y, collapse="; "))
}

data_frame( pmid = sapply(articles, xpath2, ".//ArticleId[@IdType='pubmed']", xmlValue),
            journal = sapply(articles, xpath2, ".//Journal/Title", xmlValue),
           pubtypes = sapply(articles, xpath2, ".//PublicationType", xmlValue))

# A tibble: 6,172 × 3
      pmid                 journal                                          pubtypes
     <chr>                   <chr>                                             <chr>
1 11677608                  Nature Journal Article; Research Support, Non-U.S. Gov't
2 12950922  Molecular microbiology Journal Article; Research Support, Non-U.S. Gov't
3 22328765 Journal of bacteriology Journal Article; Research Support, Non-U.S. Gov't
4 11337471         Genome research                                   Journal Article
...
0
Parfait On

Since likely ArticleId is unique for each article and PublicationType may be more than one per article, consider iteratively creating dataframes instead of separate vectors.

Specifically, use node indexing, [#], across each PubmedArticle node of XML doc since this is the shared ancestor of id and type, then xpath to needed descendants. Below creates a list of dataframes of equal length to fwd:

trial1 <- lapply(fwd, function(doc) {
  # RETRIEVE NUMBER OF ARTICLES PER EACH XML
  num_of_articles <- length(xpathSApply(doc, "//PubmedArticle"))

  # LOOP THROUGH EACH ARTICLE AND BIND XML VALUES TO DATAFRAME
  dfList <- lapply(seq(num_of_articles), function(i)
    data.frame(
     Original.PMID = xpathSApply(doc, paste0("//PubmedArticle[",i,"]/descendant::ArticleId[@IdType='pubmed']"), xmlValue),
     Publication.type = xpathSApply(doc, paste0("//PubmedArticle[",i,"]/descendant::PublicationTypeList/PublicationType"), xmlValue)
  ))

  # ROW BIND ALL DFS INTO ONE
  df <- do.call(rbind, dfList)
})

For a final master dataframe across all batches, run do.call(rbind, ...) again out the loop:

finaldf <- do.call(rbind, trial1)