Parsing XML keeps giving nodeset of 0

150 Views Asked by At

This is my first ever stack question, so if I did something wrong please tell me.

I am trying to parse data with the xml2 package and possibly the pandas package. Beneath you can find a anonymized snapshot of the data.

<?xml version="1.0" encoding="utf-8"?>
<a xmlns:xsd="http://www.y.org/y1/y2" xmlns:xsi="http://www.y.org/y1/y3" xmlns="http://x.nl/">
  <b1>1</b1>
  <b2>2019-07-01T10:01:35.312+02:00</b2>
  <b3>xxx</b3>
  <b4>xxx</b4>
  <b5>
    <c>
      <d1>
      </d1>
      <d2>xxxx</d2>
      <d3>
        <e1>
        </e1>
        <e2>
          <ID>1</ID>
          <f2>XXXXXXXXXXX</f2>
          <event>
            <eventType>start</eventType>
            <eventValue>true</eventValue>
            <timestamp>2019-10-07T13:45:00.00+02.00</timestamp>
          </event>
          <event>
            <eventType>next</eventType>
            <eventValue>itm1</eventValue>
            <timestamp>2019-10-07T13:46:00.00+02.00</timestamp>
          </event>
          <event>
            <eventType>next</eventType>
            <eventValue>itm2</eventValue>
            <timestamp>2019-10-07T13:47:00.00+02.00</timestamp>
          </event>
          <event>
            <eventType>next</eventType>
            <eventValue>itm3</eventValue>
            <timestamp>2019-10-07T13:48:00.00+02.00</timestamp>
          </event>
          

I want to create something like the table below.

+-----------+------------+------------------------------+
| EventType | EventValue |          timestamp           |
+-----------+------------+------------------------------+
| start     | true       | 2019-10-07T13:45:00.00+02.00 |
| next      | itm1       | 2019-10-07T13:46:00.00+02.00 |
| next      | itm2       | 2019-10-07T13:47:00.00+02.00 |
| next      | itm3       | 2019-10-07T13:48:00.00+02.00 |
+-----------+------------+------------------------------+

I tried xml_find_all function to find all events, but I always get {xml_nodeset (0))}.

x <- xml_find_all(data, "//event", xml_ns(data))

Could someone send me in the right direction and possibly give me a hint to create a dataframe like above as well? Would be amazing

1

There are 1 best solutions below

0
On BEST ANSWER

This XML file contains some namespaces:

> xml_ns(data)
d1  <-> http://x.nl/
xsd <-> http://www.y.org/y1/y2
xsi <-> http://www.y.org/y1/y3

To read nodes from it, there are 2 ways. The easy way is to remove all the namespaces:

xml_ns_strip(data)
events <- xml_find_all(data, "//event")
df_event <- 
  data.frame(
    EventType = events %>% xml_find_first("./eventType") %>% xml_text(),
    EventValue = events %>% xml_find_first("./eventValue") %>% xml_text(),
    timestamp = events %>% xml_find_first("./timestamp") %>% xml_text()
  )

Or you can add the prefix to your XPath to get the nodes:

events <- xml_find_all(data, "//d1:event")  # d1 is the default namespace
df_event <- 
  data.frame(
    EventType = events %>% xml_find_first("./d1:eventType") %>% xml_text(),
    EventValue = events %>% xml_find_first("./d1:eventValue") %>% xml_text(),
    timestamp = events %>% xml_find_first("./d1:timestamp") %>% xml_text()
  )

Output:

> df_event
  EventType EventValue                    timestamp
1     start       true 2019-10-07T13:45:00.00+02.00
2      next       itm1 2019-10-07T13:46:00.00+02.00
3      next       itm2 2019-10-07T13:47:00.00+02.00
4      next       itm3 2019-10-07T13:48:00.00+02.00