How do I convert an XML file that looks like this:
<bible>
<b n="Psalm">
<c n="1">
<v n="1"> text text text text </v>
<v n="2"> text text text text </v>
<v n="3"> text text text text </v>
</c>
<c n="2">
<v n="1"> text text text text </v>
<v n="2"> text text text text </v>
<v n="3"> text text text text </v>
</c>
</b>
<b n="Revelation">
<c n="1">
<v n="1"> text text text text </v>
<v n="2"> text text text text </v>
<v n="3"> text text text text </v>
</c>
<c n="2">
<v n="1"> text text text text </v>
<v n="2"> text text text text </v>
<v n="3"> text text text text </v>
</c>
<c n="3">
<v n="1"> text text text text </v>
<v n="2"> text text text text </v>
<v n="3"> text text text text </v>
</c>
</b>
</bible>
Into a dataframe/tibble format that looks like this:
# A tibble: 15 x 4
book chapter verse text
<chr> <dbl> <int> <chr>
1 Psalm 1 1 text text text text
2 Psalm 1 2 text text text text
3 Psalm 1 3 text text text text
4 Psalm 2 1 text text text text
5 Psalm 2 2 text text text text
6 Psalm 2 3 text text text text
7 Revelation 1 1 text text text text
8 Revelation 1 2 text text text text
9 Revelation 1 3 text text text text
10 Revelation 2 1 text text text text
11 Revelation 2 2 text text text text
12 Revelation 2 3 text text text text
13 Revelation 3 1 text text text text
14 Revelation 3 2 text text text text
15 Revelation 3 3 text text text text
I've tried using xmlToDataFrame(nodes = getNodeSet(doc, "/bible"))
from the XML package but I just get one observation with multiple columns. When I tried changing node levels for the getNodeSet function I get a duplicate subscripts for columns
error. Thanks.
Consider XSLT, the special-purpose language designed to transform XML files and sibling to XPath. Specifically, you need to flatten all data down into a single level such as verse where you migrate ancestor nodes or attributes to sibling nodes, of course repeating values for data frame setup.
Once transformed you can then use the convenience method
XML::xmlToDataFrame
suitable for flatter XML. R can run XSLT 1.0 with thexslt
package (extension toxml2
)XSLT (save as .xsl, a special .xml file)
R (no loops or mapping needed)