Cannot Read XML file from https:// site

3.3k Views Asked by At

Running R 3.2.0, R Studio 0.99.441, Windows 7 32-bit, XML package 3.98-1.2

I am trying to read a XML file from the site below using XML package, and xmlTreeParse but keep getting an error.

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml

> fileURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
> doc <- xmlTreeParse(fileURL, useInternal = TRUE)
Error: XML content does not seem to be XML: 'https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml' 

I also tried download.file () with xmlTreeParse

download.file(fileURL, destfile = "data.xml")
doc <- xmlTreeParse("data.xml", useInternalNodes = TRUE)

When I do this there is no immediate error but the varibale 'doc' has no structure and I'm not sure how to read it from this point.

2

There are 2 best solutions below

1
On

Remove s from https :

fileURL <- "http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
fileURL <- sub('https', 'http', fileURL)
doc <- htmlParse(fileURL)
0
On

This worked for me:

library(XML)
fileURL <- "https://www.w3schools.com/xml/simple.xml"
download.file(fileURL, destfile = "data.xml", method = "curl")
doc <- xmlTreeParse("data.xml", useInternalNodes = TRUE)
rootNode <- xmlRoot(doc)