I have to parse a XML-TEI document similar to this:
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader> ... </teiHeader>
<text>
<body>
<head rend="Body A">DOCUMENT_TITLE</head>
<div rend="entry">
<head rend="time">TIME_INFORMATION</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="content">
<hi rend="italic"> CONTENT </hi>
</p>
</div>
I would like to extract the TIME_INFORMATION string. Can someone give some advice please?
I am able to get easily the CONTENT with the command
doc <- htmlParse(paste0(mypath,xml_path))
content<-unlist(xpathSApply(doc,'//p[@rend="content"]//hi[@rend="italic"]',xmlValue))
But how can I extract only the TIME_INFORMATION? I tried
doc <- htmlParse(paste0(mypath,xml_path))
content<-unlist(xpathSApply(doc,'//div//head[@rend="time"]',xmlValue))
but I obtain an empty list. The only way I found is to get the whole body of each entry "div" and parse the dates with regular expressions but I would like to avoid this.
Anyway, doing this I notice that when I run:
div <- getNodeSet(doc2, '//div')
I obtain a slightly different structure of the original XML, seems that the tag disappeared. For instance, div[1] is:
> div[1]
[[1]]
<div rend="entry">
London, June 19th, 1854.
<p rend="Body A">Oxford, University Archive </p>
<p rend="content">
<hi rend="italic"> CONTENT </hi></p>
</div>
1) The XML is malformed (missing ending tags). Also since a namespace is used (as per second line of input -- the line beginning with TEI) we must refer to it in xpath.
Linesis taken from the question without modification and also shown in the Note at the end.2) Another approach is to use
xml_ns_stripto strip out the namespaces so that we can avoid dealing with them. Usingdocfrom aboveNote
Update
Have improved solution.