Calculate geographic distances with only Countries given

82 Views Asked by At

I have a dataframe Team5 in R with a column country_code that contains abbreviations of countries. I am using

Team5$Country <- countrycode(Team5$country_code, "iso2c", "country.name")

from the countrycode package to translate the abbreviations into countries. Now, I want to calculate the geographic distance between the countries (possibly via their centroids) but I don't have any data on longitude oder latitude. How can I calculate a rough distance between the countries and place it into a new column Distance?

This is a sample of my dput() :

structure(list(ALL_ID = c(1240, 3640, 3087, 3877, 4317, 2671, 
1398, 9433, 18089, 200, 3137, 7398, 21148, 22187, 167, 5814, 
292, 1908, 1451, 22795), Country = c("Ireland", "Australia", 
"Switzerland", "United Kingdom", "Angola", "Netherlands", "Spain", 
"Spain", "Spain", "Austria", "Indonesia", "France", "Canada", 
NA, "Germany", "Turkey", "South Africa", "Canada", "Australia", 
"Russia")), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L), groups = structure(list(ALL_ID = c(167, 
200, 292, 1240, 1398, 1451, 1908, 2671, 3087, 3137, 3640, 3877, 
4317, 5814, 7398, 9433, 18089, 21148, 22187, 22795), .rows = structure(list(
    15L, 10L, 17L, 1L, 7L, 19L, 18L, 6L, 3L, 11L, 2L, 4L, 5L, 
    16L, 12L, 8L, 9L, 13L, 14L, 20L), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L), .drop = TRUE))
1

There are 1 best solutions below

0
margusl On

You would need some external spatial data source that you could map / join to your existing dataset. There are quite a few of those providing country polygons along with country names/codes, usually distributed as a Shapefile, Geopackage or GeoJSON. Many such sources are also accessible directly through R packages, giscoR being one of those options that allow easy access to Eurstat GISCO data (list of countries is not limited to EU), and it is also delivered with some lightweight offline datasets.

From polygons you can get centroids and build a distance matrix of centroids, this is where sf package comes into play:

library(dplyr)
library(giscoR)
library(sf)
library(ggplot2)

# get rid of groups, NA values and duplicates
countries_clean <- countries %>% 
  ungroup() %>% 
  na.omit() %>% 
  distinct(Country)

# get country polygons from giscoR only for countries in countries_clean$Country;
# returns sf object (dataframe with geometry column(s)),
# store centroids in 2nd geometry column named "centroids"
countries_sf <- giscoR::gisco_get_countries(country = countries_clean$Country) %>% 
  mutate(centroids = st_centroid(geometry))

# resulting sf object: 
countries_sf %>% as_tibble() %>% st_as_sf()
#> Simple feature collection with 15 features and 5 fields
#> Active geometry column: geometry
#> Geometry type: GEOMETRY
#> Dimension:     XY
#> Bounding box:  xmin: -180 ymin: -46.98422 xmax: 180 ymax: 83.11556
#> Geodetic CRS:  WGS 84
#> # A tibble: 15 × 7
#>    CNTR_ID NAME_ENGL         ISO3_CODE CNTR_NAME FID                    geometry
#>    <chr>   <chr>             <chr>     <chr>     <chr>            <GEOMETRY [°]>
#>  1 AT      Austria           AUT       Österrei… AT    POLYGON ((16.94028 48.61…
#>  2 AO      Angola            AGO       Angola    AO    MULTIPOLYGON (((23.99953…
#>  3 AU      Australia         AUS       Australia AU    MULTIPOLYGON (((143.7744…
#>  4 CA      Canada            CAN       Canada-C… CA    MULTIPOLYGON (((-67.2701…
#>  5 CH      Switzerland       CHE       Schweiz-… CH    POLYGON ((9.55872 47.541…
#>  6 ES      Spain             ESP       España    ES    MULTIPOLYGON (((-1.78598…
#>  7 NL      Netherlands       NLD       Nederland NL    MULTIPOLYGON (((7.20894 …
#>  8 FR      France            FRA       France    FR    MULTIPOLYGON (((2.54601 …
#>  9 DE      Germany           DEU       Deutschl… DE    MULTIPOLYGON (((9.42015 …
#> 10 ID      Indonesia         IDN       Indonesia ID    MULTIPOLYGON (((117.586 …
#> 11 IE      Ireland           IRL       Ireland-… IE    MULTIPOLYGON (((-7.25607…
#> 12 RU      Russian Federati… RUS       Россия    RU    MULTIPOLYGON (((130.6749…
#> 13 TR      Turkey            TUR       Türkiye   TR    MULTIPOLYGON (((41.54713…
#> 14 UK      United Kingdom    GBR       United K… UK    MULTIPOLYGON (((0.38347 …
#> 15 ZA      South Africa      ZAF       Suid-Afr… ZA    MULTIPOLYGON (((31.31127…
#> # ℹ 1 more variable: centroids <POINT [°]>

# create a distance matrix of centroids, convert units from m to km and 
# set row & column names to match countries
dist_matrix <- st_distance(countries_sf$centroids) %>% 
  units::set_units(km) %>% 
  `dimnames<-`(list(countries_sf$NAME_ENGL, countries_sf$NAME_ENGL))

Resulting distance matrix:

# upper left corner of resulting dist.matrix:
dist_matrix[1:5,1:5]
#> Units: [km]
#>                Austria    Angola Australia    Canada Switzerland
#> Austria         0.0000  6666.493  14318.65  6481.604    454.6628
#> Angola       6666.4927     0.000  11974.33 12496.396   6634.6329
#> Australia   14318.6547 11974.329      0.00 14592.994  14770.6465
#> Canada       6481.6038 12496.396  14592.99     0.000   6306.7003
#> Switzerland   454.6628  6634.633  14770.65  6306.700      0.0000

# distance between centroids of Austria and Switzerland:
dist_matrix["Austria","Switzerland"]
#> 454.6628 [km]

Adding "a" distance column is bit tricky, if you think it through you'll notice that you have to widen your dataset (add columns), lengthen it (add rows), or use some other means to pack distances for all country pairs (e.g. store lists of distances in a single column).

The most straightforward method is probably just going with a wide format and joining the distance matrix to your original dataset:

# before joining we have to convert out matrix to data.frame first,
# move row labels to column that will be used for joining and
# it would not hurt to rename columns to provide some context
dist_matrix %>% 
  as.data.frame.matrix() %>% 
  tibble::rownames_to_column("Country") %>% 
  rename_with(~ paste("dist_to ", .x), -Country) %>% 
  left_join(countries, .) 
#> Joining with `by = join_by(Country)`
#> # A tibble: 20 × 17
#> # Groups:   ALL_ID [20]
#>    ALL_ID Country      `dist_to  Austria` `dist_to  Angola` `dist_to  Australia`
#>     <dbl> <chr>                     <dbl>             <dbl>                <dbl>
#>  1   1240 Ireland                   1688.             7678.               15666.
#>  2   3640 Australia                14319.            11974.                   0 
#>  3   3087 Switzerland                455.             6635.               14771.
#>  4   3877 United King…              1384.             7623.               15307.
#>  5   4317 Angola                    6666.                0                11974.
#>  6   2671 Netherlands                801.             7268.               14846.
#>  7   1398 Spain                     1643.             6233.               15829.
#>  8   9433 Spain                     1643.             6233.               15829.
#>  9  18089 Spain                     1643.             6233.               15829.
#> 10    200 Austria                      0              6666.               14319.
#> 11   3137 Indonesia                11176.            11007.                3185.
#> 12   7398 France                    1707.             6619.               16009.
#> 13  21148 Canada                    6482.            12496.               14593.
#> 14  22187 <NA>                        NA                NA                   NA 
#> 15    167 Germany                    471.             7077.               14550.
#> 16   5814 Turkey                    1936.             6003.               12526.
#> 17    292 South Africa              8587.             2019.               10309.
#> 18   1908 Canada                    6482.            12496.               14593.
#> 19   1451 Australia                14319.            11974.                   0 
#> 20  22795 Russia                      NA                NA                   NA 
#> # ℹ 12 more variables: `dist_to  Canada` <dbl>, `dist_to  Switzerland` <dbl>,
#> #   `dist_to  Spain` <dbl>, `dist_to  Netherlands` <dbl>,
#> #   `dist_to  France` <dbl>, `dist_to  Germany` <dbl>,
#> #   `dist_to  Indonesia` <dbl>, `dist_to  Ireland` <dbl>,
#> #   `dist_to  Russian Federation` <dbl>, `dist_to  Turkey` <dbl>,
#> #   `dist_to  United Kingdom` <dbl>, `dist_to  South Africa` <dbl>

For reference, polygons and centroids of AT & CH:

countries_sf %>% 
  filter(FID %in% c("AT", "CH")) %>% 
  ggplot(aes(fill = FID)) +
  geom_sf() +
  geom_sf(aes(geometry = centroids)) +
  theme_light()

Input data:

countries <- structure(list(ALL_ID = c(1240, 3640, 3087, 3877, 4317, 2671, 
1398, 9433, 18089, 200, 3137, 7398, 21148, 22187, 167, 5814, 
292, 1908, 1451, 22795), Country = c("Ireland", "Australia", 
"Switzerland", "United Kingdom", "Angola", "Netherlands", "Spain", 
"Spain", "Spain", "Austria", "Indonesia", "France", "Canada", 
NA, "Germany", "Turkey", "South Africa", "Canada", "Australia", 
"Russia")), class = c("grouped_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L), groups = structure(list(ALL_ID = c(167, 
200, 292, 1240, 1398, 1451, 1908, 2671, 3087, 3137, 3640, 3877, 
4317, 5814, 7398, 9433, 18089, 21148, 22187, 22795), .rows = structure(list(
    15L, 10L, 17L, 1L, 7L, 19L, 18L, 6L, 3L, 11L, 2L, 4L, 5L, 
    16L, 12L, 8L, 9L, 13L, 14L, 20L), ptype = integer(0), class = c("vctrs_list_of", 
"vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -20L), .drop = TRUE))

Created on 2023-06-26 with reprex v2.0.2