I have a XML like this:
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader> ... </teiHeader>
<text>
<body>
<head rend="Body A">DOCUMENT_TITLE</head>
<div rend="entry">
<head rend="time">TIME_1</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="content"> <hi rend="italic"> CONTENT1 </hi> </p>
</div>
<div rend="entry">
<head rend="time">TIME_2</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="Body A"> INFORMATION A</p>
</div>
<div rend="entry">
<head rend="time">TIME_3</head>
<p rend="Body A"> INFORMATION A</p>
<p rend="content">
<hi rend="italic"> CONTENT3 </hi>
</p>
<div rend="entry">
<p rend="Body A"> INFORMATION A</p>
<p rend="content">
<hi rend="italic"> CONTENT4 </hi>
</p>
</div>
</body>
</text>
</TEI>
... with many missing arguments, but I would like to obtain a data.frame with a line for each "div" like the following one:
| div |
time |
content |
| 1 |
time1 |
content1 |
| 2 |
time2 |
NA |
| 3 |
time3 |
content3 |
| 4 |
NA |
content4 |
with NA when the argument is missing.
I try an approach like this one
data_xml <- read_xml(xmlfile)
div <-xml_find_all(data_xml, xpath = ".//div")
df <- tibble::tibble(
date = div %>% xml_text(),
content = div %>% xml_find_first('./p[@rend="content"/hi[@rend="italic"]]') %>% xml_text()
)
but the xml_find_all does indeed return an empty list.
Following some suggestions I try this way, actually working
doc <- htmlParse(xmlfile)
div <- getNodeSet(doc, '//div')
dates<- xpathSApply(doc,'//div/text()',xmlValue)
abstracts<-unlist(xpathSApply(doc,'//p[@rend="content"]//hi[@rend="italic"]',xmlValue))
I correctly obtained the strings I wanted BUT I lost the correspondency, since many div have no content or no head with time information (meaning that div, dates, abstracts have different lengths). Any suggestions? TIA