I need to extract data from a large xml file in R. The file size is 60 MB. I use the following R code to download the data from the Internet:
library(XML)
library(httr)
url = "http://hydro1.sci.gsfc.nasa.gov/daac-bin/his/1.0/NLDAS_NOAH_002.cgi"
SOAPAction = "http://www.cuahsi.org/his/1.0/ws/GetSites"
envelope = "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<soap:Envelope xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\" xmlns:soap=\"http://schemas.xmlsoap.org/soap/envelope/\">\n<soap:Body>\n<GetSites xmlns=\"http://www.cuahsi.org/his/1.0/ws/\">\n<site></site><authToken></authToken>\n</GetSites>\n</soap:Body>\n</soap:Envelope>"
response = POST(url, body = envelope,
add_headers("Content-Type" = "text/xml", "SOAPAction" = SOAPAction))
status.code = http_status(response)$category
Once I have received the response from the server, I use the following code to parse the data into a data.frame:
# Parse the XML into a tree
WaterML = content(response, as="text")
SOAPdoc = xmlRoot(xmlTreeParse(WaterML, getDTD=FALSE, useInternalNodes = TRUE))
doc = SOAPdoc[[1]][[1]][[1]]
# Allocate a new empty data frame with same name of rows as the number of sites
N = xmlSize(doc) - 1
df = data.frame(SiteName=rep("",N),
SiteID=rep(NA, N),
SiteCode=rep("",N),
Latitude=rep(NA,N),
Longitude=rep(NA,N),
stringsAsFactors=FALSE)
# Populate the data frame with the values
# This loop is VERY SLOW it takes around 10 MINUTES!
start.time = Sys.time()
for(i in 1:N){
siteInfo = doc[[i+1]][[1]]
siteList = xmlToList(siteInfo)
siteName = siteList$siteName
sCode = siteList$siteCode
siteCode = sCode$text
siteID = ifelse(is.null(sCode$.attrs["siteID"]), siteCode, sCode$.attrs["siteID"])
latitude = as.numeric(siteList$geoLocation$geogLocation$latitude)
longitude = as.numeric(siteList$geoLocation$geogLocation$longitude)
}
end.time = Sys.time()
time.taken = end.time - start.time
time.taken
The for loop that I use to parse the XML into a data.frame is very slow. It takes around 10 minutes to complete. Is there any way to make the loop faster?
I was able to get better performance by using xpath expressions to extract the information you want. Each of the calls to
xpathSApply
takes ~20 seconds on my laptop, so all the commands complete in less than 2 minutes.