R - Why can't I match country codes to custom dictionary?

95 Views Asked by At

I'm working with gtap data and would like to combine it with other datasets. I am trying to find a way to use country ID's/codes, and it seems one option is to use the R package countrycodes. However, gtap is not included in the supported codelist in the package. I was trying to create a custom dictionary, but unsuccessfully.

Example gtap data:

gtap <- structure(list(COMM = c("coa", "coa", "coa", "coa", "coa", "coa"
), Source = c("afg", "afg", "afg", "afg", "afg", "afg"), Destination = c("afg", 
"alb", "are", "arg", "arm", "aus"), TotValue = c(9.99999997475243e-07, 
7.83022114774212e-05, 0.00216353917494416, 0.000611430441495031, 
2.76709855029367e-08, 2.72226079687243e-05)), row.names = c(NA, 
6L), class = "data.frame")

This is what I've tried:

library(countrycode)
library(tidyverse)

get_dictionary()

cd <- get_dictionary("gtap10")

gtap_iso3c <- gtap %>% 
  mutate(countrycode(Source, "gtap.cha", "iso3c"))
Error in `mutate()`:
ℹ In argument: `countrycode(Source, "gtap.cha", "iso3c")`.
Caused by error in `countrycode()`:
! The `origin` argument must be a string of length 1 equal to one of these values: cctld, country.name, country.name.de, country.name.fr, country.name.it, cowc, cown, dhs, ecb, eurostat, fao, fips, gaul, genc2c, genc3c, genc3n, gwc, gwn, imf, ioc, iso2c, iso3c, iso3n, p5c, p5n, p4c, p4n, un, un_m49, unicode.symbol, unhcr, unpd, vdem, wb, wb_api2c, wb_api3c, wvs, country.name.en.regex, country.name.de.regex, country.name.fr.regex, country.name.it.regex.
Run `rlang::last_trace()` to see where the error occurred.
> 
1

There are 1 best solutions below

0
CJ Yetman On

First of all, in order to use a custom dictionary with countrycode() one must use the argument custom_dict = cd where cd is a data frame containing the matching codes/names.

However, the "gtap10" custom dictionary you are using is not suitable for matching "gtap.cha" to "iso3c"... 1. because it does not contain iso3c codes, and 2. because the "gtap.cha" column contains numerous duplicate values, so it cannot be used as an "origin", e.g. if you were going from gtap.cha -> country.name, "aus" would result in multiple matches: Australia, Christmas Island, Cocos (Keeling) Islands, etc.

dplyr::tibble(countrycode::get_dictionary("gtap10"))
#> # A tibble: 244 × 5
#>    country.name             country.name.en.regex    gtap.name gtap.num gtap.cha
#>    <chr>                    <chr>                    <chr>        <int> <chr>   
#>  1 Australia                "australia"              Australia        1 AUS     
#>  2 Christmas Island         "christmas"              Australia        1 AUS     
#>  3 Cocos (Keeling) Islands  "\\bcocos|keeling"       Australia        1 AUS     
#>  4 Heard & McDonald Islands "heard.*mcdonald"        Australia        1 AUS     
#>  5 Norfolk Island           "norfolk"                Australia        1 AUS     
#>  6 New Zealand              "new.?zealand"           New Zeal…        2 NZL     
#>  7 American Samoa           "^(?=.*americ).*samoa"   Rest of …        3 XOC     
#>  8 Cook Islands             "\\bcook"                Rest of …        3 XOC     
#>  9 Fiji                     "fiji"                   Rest of …        3 XOC     
#> 10 French Polynesia         "french.?polynesia|tahi… Rest of …        3 XOC     
#> # ℹ 234 more rows