Sapply Stopping after the First Instance when Parsing XML

99 Views Asked by At

<- updated for completeness (thanks to hrbrmstr for pointing it out)->

I'm trying to extract some data from Pubmed and I've been reading the example from here (relevant diagram here). A redacted version of my data looks like:

<PubmedArticleSet>
   <PubmedArticle>
      <MedlineCitation Owner="NLM" Status="MEDLINE">
         <PMID Version="1">11841882</PMID>
         <Article PubModel="Print">
            <PublicationTypeList>
               <PublicationType UI="D002363">Case Reports</PublicationType>
               <PublicationType UI="D016428">Journal Article</PublicationType>
            </PublicationTypeList>
         </Article>
         <MeshHeadingList>
            <MeshHeading>
               <DescriptorName MajorTopicYN="N" UI="D016887">Cardiopulmonary Resuscitation</DescriptorName>
            </MeshHeading>
            <MeshHeading>
               <DescriptorName MajorTopicYN="N" UI="D006323">Heart Arrest</DescriptorName>
               <QualifierName MajorTopicYN="Y" UI="Q000188">drug therapy</QualifierName>
               <QualifierName MajorTopicYN="N" UI="Q000401">mortality</QualifierName>
               <QualifierName MajorTopicYN="N" UI="Q000628">therapy</QualifierName>
            </MeshHeading>
         </MeshHeadingList>
      </MedlineCitation>       
   </PubmedArticle>

   <PubmedArticle>
      <MedlineCitation Owner="NLM" Status="MEDLINE">
         <PMID Version="1">11841881</PMID>
         <Article PubModel="Print">
            <PublicationTypeList>
               <PublicationType UI="D016428">Journal Article</PublicationType>
            </PublicationTypeList>
         </Article>
      <MeshHeadingList>
           <MeshHeading>
               <DescriptorName MajorTopicYN="N" UI="D000368">Aged</DescriptorName>
           </MeshHeading>
           <MeshHeading>
              <DescriptorName MajorTopicYN="N" UI="D016887">Cardiopulmonary Resuscitation</DescriptorName>
           </MeshHeading>
        </MeshHeadingList>
     </MedlineCitation>    
   </PubmedArticle>
</PubmedArticleSet>

So far, I've been able to nicely extract the PublicationTypes using the following code (please run the code in the top segment at the end of this post first):

utilAtype <- function(x){
        PMID <- xmlValue(x[[1]][[1]])
        PublicationType <- sapply(xmlChildren(x[["Article"]][["PublicationTypeList"]], omitNodeTypes = "XMLInternalTextNode"), xmlValue)
        data.frame(PMID = PMID, PublicationType=PublicationType, stringsAsFactors = FALSE)
}

PMIDAType <- xpathApply(hdisease, '//MedlineCitation', utilAtype)
PMIDAType <-do.call(rbind, PMIDAType)

PMID PublicationType

11841882 Case Reports

11841882 Journal Article

11841881 Journal Article

However, using a similar approach on the MeshHeadings results in sapply skipping the rest of the subnodes as below:

PMID LName

11841882 Cardiopulmonary Resuscitation

-Other entries for 11841182 Missing-

11841881 Aged

Would appreciate if anyone could enlighten me on this? The way it's done in the sample suggests that this approach should have worked with no issues. Please see code below for reference.

require("XML")
xmlfile=xmlParse("file.xml", useInternalNodes = TRUE)
hdisease = xmlRoot(xmlfile)

utilMesh <- function(x){
        PMID <- xmlValue(x[[1]][[1]])
        MHead <- ifelse(is.null(x[["MeshHeadingList"]]), NA, 
                sapply(xmlChildren(x[["MeshHeadingList"]], omitNodeTypes = "XMLInternalTextNode"), function(z) xmlValue(z[["DescriptorName"]])))
        data.frame(PMID = PMID, MHead=MHead, stringsAsFactors = FALSE)
    }

PMIDMesh <- xpathApply(hdisease, '//MedlineCitation', utilMesh)
PMIDMesh<-do.call(rbind, PMIDMesh)

c<-nrow(PMIDMesh)
row.names(PMIDMesh) <- 1:c
nrow(table(PMIDMesh))

write.csv(PMIDMesh,"Mesh1.csv")
1

There are 1 best solutions below

0
On BEST ANSWER

I would use xpath instead, maybe...

library(rentrez)
x <- entrez_fetch("pubmed", "xml", id=c(11841882,11841881))
doc <- xmlParse(x)
pubs <- getNodeSet(doc, "//PubmedArticle")

y <- lapply(pubs, function(x) data.frame(
     pmid = xpathSApply(x, ".//MedlineCitation/PMID", xmlValue),
     mesh =  xpathSApply(x, ".//MeshHeading/DescriptorName", xmlValue)) )

do.call("rbind", y)

       pmid                          mesh
1  11841882 Cardiopulmonary Resuscitation
2  11841882              Child, Preschool
3  11841882                        Female
4  11841882                  Heart Arrest
5  11841882                        Humans
6  11841882                        Infant
7  11841882                          Male
8  11841882         Retrospective Studies
9  11841882                  Time Factors
10 11841882        Vasoconstrictor Agents
11 11841882                  Vasopressins
12 11841881                          Aged
13 11841881 Cardiopulmonary Resuscitation
14 11841881         Electric Countershock
15 11841881               Family Practice
...