... <" /> ... <" /> ... <"/>

XML (TEI document) parsing in R: how can I extract only the head?

69 Views Asked by At

I have to parse a XML-TEI document similar to this:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader> ... </teiHeader>
<text>
 <body>
  <head rend="Body A">DOCUMENT_TITLE</head>
    <div rend="entry">
    <head rend="time">TIME_INFORMATION</head>
    <p rend="Body A"> INFORMATION A</p>
    <p rend="content">
        <hi rend="italic"> CONTENT </hi>
            </p>
    </div>

I would like to extract the TIME_INFORMATION string. Can someone give some advice please?

I am able to get easily the CONTENT with the command

doc <- htmlParse(paste0(mypath,xml_path))
content<-unlist(xpathSApply(doc,'//p[@rend="content"]//hi[@rend="italic"]',xmlValue))

But how can I extract only the TIME_INFORMATION? I tried

doc <- htmlParse(paste0(mypath,xml_path))
content<-unlist(xpathSApply(doc,'//div//head[@rend="time"]',xmlValue))

but I obtain an empty list. The only way I found is to get the whole body of each entry "div" and parse the dates with regular expressions but I would like to avoid this.

Anyway, doing this I notice that when I run:

div <- getNodeSet(doc2, '//div')

I obtain a slightly different structure of the original XML, seems that the tag disappeared. For instance, div[1] is:

> div[1]
[[1]]
<div rend="entry">
        London, June 19th, 1854.
        <p rend="Body A">Oxford, University Archive </p>
        <p rend="content">
            <hi rend="italic"> CONTENT </hi></p>
    </div> 
1

There are 1 best solutions below

0
G. Grothendieck On

1) The XML is malformed (missing ending tags). Also since a namespace is used (as per second line of input -- the line beginning with TEI) we must refer to it in xpath. Lines is taken from the question without modification and also shown in the Note at the end.

library(xml2)
doc <- Lines |> 
  paste("</body></text></TEI>") |>
  read_xml()

xml_ns(doc)
## d1 <-> http://www.tei-c.org/ns/1.0

xml_find_first(doc, 'normalize-space(//d1:head[@rend="time"])')
## [1] "TIME_INFORMATION"

2) Another approach is to use xml_ns_strip to strip out the namespaces so that we can avoid dealing with them. Using doc from above

doc |>
  xml_ns_strip() |>
  xml_find_first('normalize-space(//head[@rend="time"])')
## [1] "TIME_INFORMATION"

Note

Lines <- '<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader> ... </teiHeader>
<text>
 <body>
  <head rend="Body A">DOCUMENT_TITLE</head>
    <div rend="entry">
    <head rend="time">TIME_INFORMATION</head>
    <p rend="Body A"> INFORMATION A</p>
    <p rend="content">
        <hi rend="italic"> CONTENT </hi>
            </p>
    </div>
'

Update

Have improved solution.