I am trying to extract data from https://prtr.defra.gov.uk/full-dataset and the file is called uk_prtr_dataset_2021.xml
library(xml2)
library(dplyr)
xml_file <- read_xml('uk_prtr_dataset_2021.xml')
I want to extract following nodes
node_vec <- c("ParentCompanyName", "FacilityName", "LongitudeMeasure", "LatitudeMeasure", "MainEconomicActivityName")
for(node in seq_along(node_vec)){
node_ref <- node_vec[node]
var_ref <- paste0(".//rsm:",node_ref)
temp <- xml_file %>%
xml_find_all(var_ref) %>%
as_list() %>%
simplify() %>%
enframe() %>%
unnest(value) %>%
unnest(value)
print(paste0(node_ref,": ",nrow(temp)))
}
[1] "ParentCompanyName: 6305"
[1] "FacilityName: 6306"
[1] "LongitudeMeasure: 6305"
[1] "LatitudeMeasure: 6305"
[1] "MainEconomicActivityName: 6305"
However, I see that FacilityName
has one extra entry due to which I cannot bind the 5 columns together. Is there anyway I can figure out which is the extra entry so that I can remove it.
EDIT: A small part of XML
<rsm:PollutantReleaseAndTransferReport xmlns:rsm="urn:eu:com:env:prtr:data:standard:2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:eu:com:env:prtr:data:standard:2 http://www.eionet.europa.eu/schemas/eprtr/PollutantReleaseAndTransferReport_2p0.xsd">rsm:ReportingYear2021</rsm:ReportingYear>rsm:CountryIDUK</rsm:CountryID>rsm:CoordinateSystemIDEPSG:4326</rsm:CoordinateSystemID>rsm:RemarkTextNone</rsm:RemarkText>rsm:CompetentAuthorityPartyrsm:NameBEIS</rsm:Name>rsm:Addressrsm:StreetName1 Victoria Street</rsm:StreetName>rsm:CityNameLondon</rsm:CityName>rsm:PostcodeCodeSW1H 0ET</rsm:PostcodeCode></rsm:Address>rsm:TelephoneCommunicationrsm:CompleteNumberText00 44 774 169 9372</rsm:CompleteNumberText></rsm:TelephoneCommunication>rsm:FaxCommunicationrsm:CompleteNumberText--</rsm:CompleteNumberText></rsm:FaxCommunication>rsm:EmailCommunicationrsm:[email protected]</rsm:EmailURIID></rsm:EmailCommunication>rsm:ContactPersonNameShamim Choudhury</rsm:ContactPersonName></rsm:CompetentAuthorityParty>rsm:CompetentAuthorityPartyrsm:NameE1</rsm:Name>rsm:Addressrsm:StreetNameAll correspondence to: Defra, Industrial Pollution, 5F Ergon House, Horseferry Rd. London.</rsm:StreetName>rsm:CityNameLondon</rsm:CityName>rsm:PostcodeCodeSW1P 2AL</rsm:PostcodeCode></rsm:Address>rsm:TelephoneCommunicationrsm:CompleteNumberTextChange Me</rsm:CompleteNumberText></rsm:TelephoneCommunication>rsm:FaxCommunicationrsm:CompleteNumberTextChange Me</rsm:CompleteNumberText></rsm:FaxCommunication>rsm:EmailCommunicationrsm:[email protected]</rsm:EmailURIID></rsm:EmailCommunication>rsm:ContactPersonNameNational Data Manager</rsm:ContactPersonName></rsm:CompetentAuthorityParty>rsm:CompetentAuthorityPartyrsm:NameE10</rsm:Name>rsm:Addressrsm:StreetNameAll correspondence to: Defra, Industrial Pollution, 5F Ergon House, Horseferry Rd. London.</rsm:StreetName>rsm:CityNameLondon</rsm:CityName>rsm:PostcodeCodeSW1P 2AL</rsm:PostcodeCode></rsm:Address>rsm:TelephoneCommunicationrsm:CompleteNumberTextChange Me</rsm:CompleteNumberText></rsm:TelephoneCommunication>rsm:FaxCommunicationrsm:CompleteNumberTextChange Me</rsm:CompleteNumberText></rsm:FaxCommunication>rsm:EmailCommunicationrsm:[email protected]</rsm:EmailURIID></rsm:EmailCommunication>rsm:ContactPersonNameNational Data Manager</rsm:ContactPersonName>
I'd approach this bit differently; this issue resolves on its own when we move from building column vectors to trimmed-down FacilityReport records (i.e. rows for future data.frame / tibble), if something is missing, we'll just have
NA
values in those specific rows instead of varying length column vectors.We can start by finding all
rsm:FacilityReport
nodes; to speed things up a bit, we'll remove all subnodes that are not needed. From there, we can turnreports
node-set to a list and create a nested tibble, hoist nested values.Resulting frame along with the record that caused issues:
Created on 2023-11-22 with reprex v2.0.2
A
rsm:FacilityReport
node for reference: