How can I parse out city and country information from a messy and non-uniform partial address column in an R data frame?

199 Views Asked by At

I have a very large data that contains a very messy and not uniform address field. I only care to extract a country name out of it. Most of the records contain country and city and some contain other information such as street address, state or province. Because there is so much overlap (streets can be named after countries, cities or states) I was thinking of searching for country and city using R maps library in conjunction with fuzzyjoin. Here is an abbreviated example of what I tried:

library(dplyr)
library(stringr)
library(maps)
library(fuzzyjoin)

partial_address <-
  c("London 121280 Ontario Canada",
    "Milano, Italy",
    "123 First St Columbus OH USA",
    "Cali-Valle del Cauca Colombia",
    "98765 France Paris",
    "Zurich Zurich Switzerland",
    "Mexico City Mexico",
    "Nagoya 123456 Japan",
    "BEIJING BEIJING CHINA",
    "Thailand 12345 Bangkok",
    "Albania La Guahira Colombia",
    "Conakry, Guinea",
    "Guinea-Bissau, Bissau",
    "Democratic Republic of the Congo Kinshasa")

df <-
  data.frame(partial_address) %>%
  mutate(partial_address = toupper(partial_address))

data(world.cities)

world_cities <-
  distinct(world.cities,
           name,
           country.etc) %>%
  mutate(name = toupper(name),
         country.etc = toupper(country.etc))

df_new <-
  fuzzy_left_join(df,
                  world_cities,
                  by = c("partial_address" = "name",
                         "partial_address" = "country.etc"),
                  match_fun = str_detect)

My output looks like this:

> df_new
                             partial_address        name   country.etc
1               LONDON 121280 ONTARIO CANADA      LONDON        CANADA
2                              MILANO, ITALY       MILAN         ITALY
3               123 FIRST ST COLUMBUS OH USA    COLUMBUS           USA
4              CALI-VALLE DEL CAUCA COLOMBIA        CALI      COLOMBIA
5              CALI-VALLE DEL CAUCA COLOMBIA    COLOMBIA      COLOMBIA
6                         98765 FRANCE PARIS       PARIS        FRANCE
7                  ZURICH ZURICH SWITZERLAND      ZURICH   SWITZERLAND
8                         MEXICO CITY MEXICO MEXICO CITY        MEXICO
9                         MEXICO CITY MEXICO        XICO        MEXICO
10                       NAGOYA 123456 JAPAN         AGO         JAPAN
11                       NAGOYA 123456 JAPAN        NAGO         JAPAN
12                       NAGOYA 123456 JAPAN      NAGOYA         JAPAN
13                     BEIJING BEIJING CHINA     BEIJING         CHINA
14                    THAILAND 12345 BANGKOK     BANGKOK      THAILAND
15               ALBANIA LA GUAHIRA COLOMBIA       ALBAN      COLOMBIA
16               ALBANIA LA GUAHIRA COLOMBIA     ALBANIA      COLOMBIA
17               ALBANIA LA GUAHIRA COLOMBIA    COLOMBIA      COLOMBIA
18                           CONAKRY, GUINEA     CONAKRY        GUINEA
19                     GUINEA-BISSAU, BISSAU      BISSAU GUINEA-BISSAU
20 DEMOCRATIC REPUBLIC OF THE CONGO KINSHASA        <NA>          <NA>

As you can see, it is not exactly what I want as apparently Mexico has a city named "Xico" and Colombia has a city named "Colombia", etc. Expected output should look something like this:

> df_new
                             partial_address        name   country.etc
1               LONDON 121280 ONTARIO CANADA      LONDON        CANADA
2                              MILANO, ITALY       MILAN         ITALY
3               123 FIRST ST COLUMBUS OH USA    COLUMBUS           USA
4              CALI-VALLE DEL CAUCA COLOMBIA        CALI      COLOMBIA
5                         98765 FRANCE PARIS       PARIS        FRANCE
6                  ZURICH ZURICH SWITZERLAND      ZURICH   SWITZERLAND
7                         MEXICO CITY MEXICO MEXICO CITY        MEXICO
8                        NAGOYA 123456 JAPAN      NAGOYA         JAPAN
9                      BEIJING BEIJING CHINA     BEIJING         CHINA
10                    THAILAND 12345 BANGKOK     BANGKOK      THAILAND
11               ALBANIA LA GUAHIRA COLOMBIA     ALBANIA      COLOMBIA
12                           CONAKRY, GUINEA     CONAKRY        GUINEA
13                     GUINEA-BISSAU, BISSAU     BISSAU  GUINEA-BISSAU
14 DEMOCRATIC REPUBLIC OF THE CONGO KINSHASA     KINSHASA CONGO DEMOCRATIC REPUBLIC

In addition, fuzzy_join tends to be very slow and use a lot of resources. Any suggestions on how to reduce false positive matches and improve performance are appreciated. Thank you.

2

There are 2 best solutions below

5
Jay Bee On BEST ANSWER

One approach could be to use the list of countries in maps and check it against the strings in your data and extract where there is a match.

# Load required packages.

library(tidyverse)
library(maps)

# Load and reformat data.

partial_address <-
  c("London 121280 Ontario Canada",
    "Milano, Italy",
    "123 First St Columbus OH USA",
    "Cali-Valle del Cauca Colombia",
    "98765 France Paris",
    "Zurich Zurich Switzerland",
    "Mexico City Mexico",
    "Nagoya 123456 Japan",
    "BEIJING BEIJING CHINA",
    "Thailand 12345 Bangkok")

df <-
  data.frame(partial_address) %>%
  mutate(partial_address = toupper(partial_address))

# Get country data.

world_map_data <- map_data("world")

country_names <- unique(world_map_data$region) %>%
  toupper()

# Function to match country data with partial address.

find_country <- function(partial_address, countries) {
  for (country in countries) {
    if (str_detect(partial_address, fixed(country))) {
      return(country)
    }
  }
  return(NA)
}

df$country <- map_chr(df$partial_address, ~find_country(., country_names))

With the output of df:

No. Partial Address Country
1 LONDON 121280 ONTARIO CANADA CANADA
2 MILANO, ITALY ITALY
3 123 FIRST ST COLUMBUS OH USA USA
4 CALI-VALLE DEL CAUCA COLOMBIA COLOMBIA
5 98765 FRANCE PARIS FRANCE
6 ZURICH ZURICH SWITZERLAND SWITZERLAND
7 MEXICO CITY MEXICO MEXICO
8 NAGOYA 123456 JAPAN JAPAN
9 BEIJING BEIJING CHINA CHINA
10 THAILAND 12345 BANGKOK THAILAND
2
Elin On

I think you have to know that in parsing things like this the only thing to do is to proceed step by step, checking carefully at each point. For example, you mention that Albania is both a country and a city in Colombia--and actually there are two separate cities in Columbia with that name. Also 4 cities called Aksu in Kazakhstan.

In world.cities there are, in fact, 29 strings that are names of both cities and countries.

intersect(world.cities$country.etc, world.cities$name)
 [1] "Jordan"       "Greenland"    "Mexico"       "Colombia"    
 [5] "India"        "Benin"        "Armenia"      "El Salvador" 
 [9] "Venezuela"    "Guatemala"    "China"        "Panama"      
[13] "San Marino"   "Puerto Rico"  "Palau"        "Costa Rica"  
[17] "Togo"         "Andorra"      "Mali"         "Albania"     
[21] "Belize"       "Liberia"      "Gibraltar"    "Tonga"       
[25] "Saint-Martin" "Montenegro"   "Singapore"    "Vatican City"

There are also 11 countries that have only one city.

 [1] "British Virgin Islands" "Easter Island"         
 [3] "Gibraltar"              "Nauru"                 
 [5] "Norfolk Island"         "Pitcairn"              
 [7] "Saint-Barthelemy"       "Saint-Martin"          
 [9] "Seychelles"             "Singapore"             
[11] "Vatican City" 

There are

nrow( world.cities[!duplicated(world.cities[, c("name", "country.etc")]),])

42679 non-duplicated combinations and 966 that are duplicated.

I'd make a column for each record in your data that indicates if you have completed the match for that column.

So now you have to come up with an attack plan. I'd start by deciding what you will do about the non-unique-city-country combinations. Maybe decide for now to just assign them to the first mentioned or the largest or something like that.

Then go for the easy cases, like the the countries that have just one city.

Every time you finish a step mark the ones that are assigned, and remove them.

Maybe first look for the names with spaces, then when they are done, look for the ones without spaces.

And so on.

Without having your data it's hard to say more. I can think of various things I would try to make the process quicker, but it is hard to say.

Update

If you don't need the cities, even easier.
For example you could apply a grep or grepl for each country name. If a record has two hits use the one that is not also a city name.

There are a lot of other questions and answers about how to grep multiple strings at once. For example R: grep multiple strings at once