I need to join two datasets and the only identifier in both are the company names. For example:
db1 <- tibble(
Company = c('Bombardier Inc.','Honeywell Development Corp','The Pepsi Bottling Group (Canada), Ulc (“Pbgc”)','PepsiCo Canada ULC'),
var1 = 1:4
)
db2 <- tibble(
Name = c('Bombardier Inc.','Honeywell Dev Corp','The Pepsi Bottling Group (Canada), ULC','PepsiCo Canada ULC (“Pcu”)'),
var2 = 6:9
)
Obviously a straightforward dplyr::left_join()
is not going to work. I tried the following which didn't work:
fuzzyjoin::regex_left_join(db1,db2,by=c('Company'='Name'))
# A tibble: 4 x 4
Company var1 Name var2
<chr> <int> <chr> <int>
1 Bombardier Inc. 1 Bombardier Inc. 6
2 Honeywell Development Corp 2 NA NA
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”) 3 NA NA
4 PepsiCo Canada ULC 4 NA NA
I made some progress by removing "nonessential" characters from the names:
db1 <- db1 %>% mutate(Company.alt = str_remove_all(Company,regex(
'The|Canada|Inc|Ltd|Company|\\bCo\\b|Corporation|Corp|Group|ULC|[:punct:]',
ignore_case = T
)) %>% str_squish())
db2 <- db2 %>% mutate(Name.alt = str_remove_all(Name,regex(
'The|Canada|Inc|Ltd|Company|\\bCo\\b|Corporation|Corp|Group|ULC|[:punct:]',
ignore_case = T
)) %>% str_squish())
fuzzyjoin::regex_left_join(db1,db2,by=c('Company.alt'='Name.alt'))
# A tibble: 4 x 6
Company var1 Company.alt Name var2 Name.alt
<chr> <int> <chr> <chr> <int> <chr>
1 Bombardier Inc. 1 Bombardier Bombardier In~ 6 Bombardi~
2 Honeywell Development Corp 2 Honeywell Development Honeywell Dev~ 7 Honeywel~
3 The Pepsi Bottling Group (Canada), Ulc (“Pbgc”) 3 Pepsi Bottling Pbgc The Pepsi Bot~ 8 Pepsi Bo~
4 PepsiCo Canada ULC 4 PepsiCo NA NA NA
But this still left the last row unmatched. For greater clarity, the last row of Company.alt
is PepsiCo
which is not considered a fuzzy match with Name.alt
's last row of PepsiCo Pcu
.
Is there way to successfully left-join the two datasets?
1) phonics The phonics package has numerous methods for approximate matching such as soundex. See the package documentation for other methods.
giving:
2) SQLite SQLite has a built-in soundex function.
giving: