unequal vector in xml from different nodes

54 Views Asked by At

I am trying to extract data from https://prtr.defra.gov.uk/full-dataset and the file is called uk_prtr_dataset_2021.xml

library(xml2)
library(dplyr) 

xml_file <- read_xml('uk_prtr_dataset_2021.xml') 

I want to extract following nodes

node_vec <- c("ParentCompanyName", "FacilityName", "LongitudeMeasure", "LatitudeMeasure", "MainEconomicActivityName")

for(node in seq_along(node_vec)){

  node_ref <- node_vec[node]
  var_ref <- paste0(".//rsm:",node_ref)

  temp <- xml_file %>%   
           xml_find_all(var_ref) %>%
           as_list() %>%
           simplify() %>%
           enframe() %>%
           unnest(value) %>%
           unnest(value)

print(paste0(node_ref,": ",nrow(temp)))
}

[1] "ParentCompanyName: 6305"
[1] "FacilityName: 6306"
[1] "LongitudeMeasure: 6305"
[1] "LatitudeMeasure: 6305"
[1] "MainEconomicActivityName: 6305"

However, I see that FacilityName has one extra entry due to which I cannot bind the 5 columns together. Is there anyway I can figure out which is the extra entry so that I can remove it.

EDIT: A small part of XML

<rsm:PollutantReleaseAndTransferReport xmlns:rsm="urn:eu:com:env:prtr:data:standard:2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:eu:com:env:prtr:data:standard:2 http://www.eionet.europa.eu/schemas/eprtr/PollutantReleaseAndTransferReport_2p0.xsd">rsm:ReportingYear2021</rsm:ReportingYear>rsm:CountryIDUK</rsm:CountryID>rsm:CoordinateSystemIDEPSG:4326</rsm:CoordinateSystemID>rsm:RemarkTextNone</rsm:RemarkText>rsm:CompetentAuthorityPartyrsm:NameBEIS</rsm:Name>rsm:Addressrsm:StreetName1 Victoria Street</rsm:StreetName>rsm:CityNameLondon</rsm:CityName>rsm:PostcodeCodeSW1H 0ET</rsm:PostcodeCode></rsm:Address>rsm:TelephoneCommunicationrsm:CompleteNumberText00 44 774 169 9372</rsm:CompleteNumberText></rsm:TelephoneCommunication>rsm:FaxCommunicationrsm:CompleteNumberText--</rsm:CompleteNumberText></rsm:FaxCommunication>rsm:EmailCommunicationrsm:[email protected]</rsm:EmailURIID></rsm:EmailCommunication>rsm:ContactPersonNameShamim Choudhury</rsm:ContactPersonName></rsm:CompetentAuthorityParty>rsm:CompetentAuthorityPartyrsm:NameE1</rsm:Name>rsm:Addressrsm:StreetNameAll correspondence to: Defra, Industrial Pollution, 5F Ergon House, Horseferry Rd. London.</rsm:StreetName>rsm:CityNameLondon</rsm:CityName>rsm:PostcodeCodeSW1P 2AL</rsm:PostcodeCode></rsm:Address>rsm:TelephoneCommunicationrsm:CompleteNumberTextChange Me</rsm:CompleteNumberText></rsm:TelephoneCommunication>rsm:FaxCommunicationrsm:CompleteNumberTextChange Me</rsm:CompleteNumberText></rsm:FaxCommunication>rsm:EmailCommunicationrsm:[email protected]</rsm:EmailURIID></rsm:EmailCommunication>rsm:ContactPersonNameNational Data Manager</rsm:ContactPersonName></rsm:CompetentAuthorityParty>rsm:CompetentAuthorityPartyrsm:NameE10</rsm:Name>rsm:Addressrsm:StreetNameAll correspondence to: Defra, Industrial Pollution, 5F Ergon House, Horseferry Rd. London.</rsm:StreetName>rsm:CityNameLondon</rsm:CityName>rsm:PostcodeCodeSW1P 2AL</rsm:PostcodeCode></rsm:Address>rsm:TelephoneCommunicationrsm:CompleteNumberTextChange Me</rsm:CompleteNumberText></rsm:TelephoneCommunication>rsm:FaxCommunicationrsm:CompleteNumberTextChange Me</rsm:CompleteNumberText></rsm:FaxCommunication>rsm:EmailCommunicationrsm:[email protected]</rsm:EmailURIID></rsm:EmailCommunication>rsm:ContactPersonNameNational Data Manager</rsm:ContactPersonName>

1

There are 1 best solutions below

0
On BEST ANSWER

I'd approach this bit differently; this issue resolves on its own when we move from building column vectors to trimmed-down FacilityReport records (i.e. rows for future data.frame / tibble), if something is missing, we'll just have NA values in those specific rows instead of varying length column vectors.

We can start by finding all rsm:FacilityReport nodes; to speed things up a bit, we'll remove all subnodes that are not needed. From there, we can turn reports node-set to a list and create a nested tibble, hoist nested values.

library(xml2)
library(dplyr) 
library(tidyr)

tictoc::tic("load and process")
xml_file <- read_xml('uk_prtr_dataset_2021.xml') 

# collect all reports
reports <- xml_find_all(xml_file, "//rsm:FacilityReport")

# collect and remove all other nodes so we could get a clean result from as_list(reports)
to_remove <- xml_find_all(reports, "./*[not(self::rsm:ParentCompanyName or self::rsm:FacilityName or self::rsm:GeographicalCoordinate or self::rsm:MainEconomicActivityName)]")
xml_remove(to_remove)

reports_tbl <- 
  reports |> 
  as_list() |> 
  tibble(report = _) |>
  # extract nested values
  hoist(report, 
        parent = list("ParentCompanyName", 1),
        facility = list("FacilityName", 1),
        main_act = list("MainEconomicActivityName", 1),
        lon = list("GeographicalCoordinate","LongitudeMeasure", 1),
        lat = list("GeographicalCoordinate","LatitudeMeasure", 1),
        ) |>
  select(-report)
tictoc::toc()
#> load and process: 7.28 sec elapsed

Resulting frame along with the record that caused issues:

reports_tbl
#> # A tibble: 6,306 × 5
#>    parent                 facility                          main_act lon   lat  
#>    <chr>                  <chr>                             <chr>    <chr> <chr>
#>  1 IQE (Europe) Ltd       IQE (Europe) Limited, St Mellons… Manufac… -3.0… 51.5…
#>  2 G B Davies & Co        G B Davies & Co, Plas Bach Farm   Raising… -3.1… 52.7…
#>  3 Hook 2 Sisters Limited Hook 2 Sisters Limited, Merllyn … Raising… -3.2… 53.2…
#>  4 Hook 2 Sisters Limited Hook 2 Sisters Limited, Morlais   Raising… -4.3… 53.1…
#>  5 Mills Poultry Ltd.     Mills Poultry Ltd., Pentreucha F… Raising… -3.4… 52.6…
#>  6 Mills Poultry Ltd.     Mills Poultry Ltd., Minffordd Fa… Raising… -3.4… 52.6…
#>  7 Mr Philip Weale        P J Weale, Highfield Poultry Farm Raising… -4.0… 51.5…
#>  8 Uniper UK Limited      Connah'S Quay Power Station       Product… -3.0… 53.2…
#>  9 Glas Cymru Cyfyngedig  Nash Stw - Final Effluent         Sewerage -2.9… 51.5…
#> 10 Glas Cymru Cyfyngedig  Five Fords Stw                    Sewerage -2.9… 53.0…
#> # ℹ 6,296 more rows

# offending FacilityReport:
reports_tbl[!complete.cases(reports_tbl),]
#> # A tibble: 1 × 5
#>   parent facility                       main_act lon   lat  
#>   <chr>  <chr>                          <chr>    <chr> <chr>
#> 1 <NA>   Newly created but not approved <NA>     <NA>  <NA>

Created on 2023-11-22 with reprex v2.0.2

A rsm:FacilityReport node for reference:

    <rsm:FacilityReport>
        <rsm:NationalID>Wales_KP3235SS</rsm:NationalID>
        <rsm:PreviousNationalID>
            <rsm:NationalID>Wales_KP3235SS</rsm:NationalID>
            <rsm:ReportingYear>2020</rsm:ReportingYear>
        </rsm:PreviousNationalID>
        <rsm:ParentCompanyName>IQE (Europe) Ltd</rsm:ParentCompanyName>
        <rsm:FacilityName>IQE (Europe) Limited, St Mellons Semiconductor Plant</rsm:FacilityName>
        <rsm:Address>
            <rsm:StreetName>Cypress Drive, Cardiff</rsm:StreetName>
            <rsm:BuildingNumber>--</rsm:BuildingNumber>
            <rsm:CityName>Cardiff</rsm:CityName>
            <rsm:PostcodeCode>CF3 0LW</rsm:PostcodeCode>
        </rsm:Address>
        <rsm:GeographicalCoordinate>
            <rsm:LongitudeMeasure>-3.085770</rsm:LongitudeMeasure>
            <rsm:LatitudeMeasure>51.530800</rsm:LatitudeMeasure>
        </rsm:GeographicalCoordinate>
        <rsm:RiverBasinDistrictID>UK09</rsm:RiverBasinDistrictID>
        <rsm:NACEMainEconomicActivityCode>27.90</rsm:NACEMainEconomicActivityCode>
        <rsm:MainEconomicActivityName>Manufacture of other electrical equipment</rsm:MainEconomicActivityName>
        <rsm:CompetentAuthorityPartyName>NRW02</rsm:CompetentAuthorityPartyName>
        <rsm:NutsRegionID>UKL2</rsm:NutsRegionID>
        <rsm:ConfidentialIndicator>0</rsm:ConfidentialIndicator>
        <rsm:ProtectVoluntaryData>0</rsm:ProtectVoluntaryData>
        <rsm:Activity>
            <rsm:RankingNumeric>1</rsm:RankingNumeric>
            <rsm:AnnexIActivityCode>4.(b).(v)</rsm:AnnexIActivityCode>
        </rsm:Activity>
        <rsm:PollutantRelease>
            <rsm:MediumCode>AIR</rsm:MediumCode>
            <rsm:PollutantCode>NH3</rsm:PollutantCode>
            <rsm:MethodBasisCode>C</rsm:MethodBasisCode>
            <rsm:MethodUsed>
                <rsm:MethodTypeCode>OTH</rsm:MethodTypeCode>
                <rsm:Designation>none specified</rsm:Designation>
            </rsm:MethodUsed>
            <rsm:TotalQuantity unitCode="KGM">1.98</rsm:TotalQuantity>
            <rsm:AccidentalQuantity unitCode="KGM">0.00</rsm:AccidentalQuantity>
            <rsm:ConfidentialIndicator>0</rsm:ConfidentialIndicator>
            <rsm:RemarkText>significant drop 2020 v 2021 is due to vastly decreased ammonia emissions to air.</rsm:RemarkText>
        </rsm:PollutantRelease>
        <rsm:PollutantRelease>
            <rsm:MediumCode>AIR</rsm:MediumCode>
            <rsm:PollutantCode>AS AND COMPOUNDS</rsm:PollutantCode>
            <rsm:MethodBasisCode>C</rsm:MethodBasisCode>
            <rsm:MethodUsed>
                <rsm:MethodTypeCode>OTH</rsm:MethodTypeCode>
                <rsm:Designation>none specified</rsm:Designation>
            </rsm:MethodUsed>
            <rsm:TotalQuantity unitCode="KGM">6.56</rsm:TotalQuantity>
            <rsm:AccidentalQuantity unitCode="KGM">0.00</rsm:AccidentalQuantity>
            <rsm:ConfidentialIndicator>0</rsm:ConfidentialIndicator>
        </rsm:PollutantRelease>
        <rsm:WasteTransfer>
            <rsm:WasteTypeCode>NON-HW</rsm:WasteTypeCode>
            <rsm:WasteTreatmentCode>D</rsm:WasteTreatmentCode>
            <rsm:Quantity unitCode="TNE">2.60</rsm:Quantity>
            <rsm:MethodBasisCode>C</rsm:MethodBasisCode>
            <rsm:MethodUsed>
                <rsm:MethodTypeCode>OTH</rsm:MethodTypeCode>
                <rsm:Designation>none specified</rsm:Designation>
            </rsm:MethodUsed>
            <rsm:ConfidentialIndicator>0</rsm:ConfidentialIndicator>
            <rsm:RemarkText>2021 non-hazardous waste to landfill.</rsm:RemarkText>
        </rsm:WasteTransfer>
        <rsm:WasteTransfer>
            <rsm:WasteTypeCode>NON-HW</rsm:WasteTypeCode>
            <rsm:WasteTreatmentCode>R</rsm:WasteTreatmentCode>
            <rsm:Quantity unitCode="TNE">7.56</rsm:Quantity>
            <rsm:MethodBasisCode>C</rsm:MethodBasisCode>
            <rsm:MethodUsed>
                <rsm:MethodTypeCode>OTH</rsm:MethodTypeCode>
                <rsm:Designation>none specified</rsm:Designation>
            </rsm:MethodUsed>
            <rsm:ConfidentialIndicator>0</rsm:ConfidentialIndicator>
            <rsm:RemarkText>waste to energy - 5107kg&#13;
recycling - 2576kg</rsm:RemarkText>
        </rsm:WasteTransfer>
        <rsm:WasteTransfer>
            <rsm:WasteTypeCode>HWIC</rsm:WasteTypeCode>
            <rsm:WasteTreatmentCode>D</rsm:WasteTreatmentCode>
            <rsm:Quantity unitCode="TNE">236</rsm:Quantity>
            <rsm:MethodBasisCode>C</rsm:MethodBasisCode>
            <rsm:MethodUsed>
                <rsm:MethodTypeCode>OTH</rsm:MethodTypeCode>
                <rsm:Designation>none specified</rsm:Designation>
            </rsm:MethodUsed>
            <rsm:ConfidentialIndicator>0</rsm:ConfidentialIndicator>
        </rsm:WasteTransfer>
    </rsm:FacilityReport>