R - Reading multiple xml files with xml2

328 Views Asked by At

With xml2 I have written a code which transforms an xml file I am using into a desired data frame. I now need to repeat this for the other 1218 xml files in my folder. However, I am struggling to work out where to start. I know I need to list the files:

files <- list.files(pattern = ".xml$")   

And then a loop or Sapply will be needed but I'm not sure how. Any advice would be much a appreciated.

Code so far:

 xmlimport <- read_xml("16770601.xml")
        class(xmlimport)
        trialaccounts <- xmlimport %>% xml_find_all('//div1[@type="trialAccount"]')
        defendants=NULL
        for(i in 1:length(trialaccounts)) {
          trialid <- trialaccounts[[i]] %>% xml_attr("id")
          year <- trialaccounts[[i]] %>% xml_find_first('.//interp[@type="year"]') %>% xml_attr("value")
          genderdefendants <- trialaccounts[[i]] %>% 
            xml_find_all('.//persName[@type="defendantName"]/interp[@type="gender"]') %>%
            xml_attr("value")
          descrip <- trialaccounts[[i]] %>% 
            xml_find_all('.//persName[@type="defendantName"]') %>% 
            xml_text(trim=TRUE)
          verdict <- trialaccounts[[i]] %>% 
            xml_find_all('.//interp[@type="verdictCategory"]')%>% xml_attr("value")
          context <- xml_text(trialaccounts[[i]])
          for(j in 1:length(genderdefendants)) { 
            defendants <- defendants %>%
              bind_rows(tibble(defendantid=i,trial_id=trialid,year_tried=year,description=descrip,verdict_result=verdict,info=context,gender=genderdefendants[j]))
          }
        }
1

There are 1 best solutions below

3
On BEST ANSWER

I would recommend writing a function to parse one xml and using package purrr to map it to your file list:

library(dplyr)
library(purrr)
my_xml_reading_function <- function(x) {
  xmlimport <- read_xml(x)
  trialaccounts <- xmlimport %>% xml_find_all('//div1[@type="trialAccount"]')
  defendants=NULL
  for(i in 1:length(trialaccounts)) {
    trialid <- trialaccounts[[i]] %>% xml_attr("id")
    year <- trialaccounts[[i]] %>% xml_find_first('.//interp[@type="year"]') %>% xml_attr("value")
    genderdefendants <- trialaccounts[[i]] %>% 
      xml_find_all('.//persName[@type="defendantName"]/interp[@type="gender"]') %>%
      xml_attr("value")
    descrip <- trialaccounts[[i]] %>% 
      xml_find_all('.//persName[@type="defendantName"]') %>% 
      xml_text(trim=TRUE)
    verdict <- trialaccounts[[i]] %>% 
      xml_find_all('.//interp[@type="verdictCategory"]')%>% xml_attr("value")
    context <- xml_text(trialaccounts[[i]])
    for(j in 1:length(genderdefendants)) { 
      defendants <- defendants %>%
        bind_rows(tibble(defendantid=i,trial_id=trialid,year_tried=year,description=descrip,verdict_result=verdict,info=context,gender=genderdefendants[j]))
    }
  }
  return(defendants)
}

result <- map(files, ~my_xml_reading_function(.x))

This will give you a list of length 1218. You can access the first result with result[[1]]. Or if you want to combine all results in one table use:

result <- map_dfr(files, ~my_xml_reading_function(.x))