How can I find and replace values between two dataframes in R

197 Views Asked by At

I have a dataframe from tidytext that contains the individual words from some survey free-response comments. It has just shy of 500,000 rows. Being free-response data, it is riddled with typos. Using textclean::replace_misspellings took care of almost 13,000 misspelled words, but there were still ~700 unique misspellings that I manually identified.

I now have a second table with two columns, the first is the misspelling and the second is the correction.

For instance

allComments <- data.frame("Number" = 1:5, "Word" = c("organization","orginization", "oragnization", "help", "hlp"))
misspellings <- data.frame("Wrong" = c("orginization", "oragnization", "hlp"), "Right" = c("organization", "organization", "help"))

How can I replace all the values of allComments$word that match misspellings$wrong with misspellings$right?

I feel like this is probably pretty basic and my R ignorance is showing....

4

There are 4 best solutions below

0
On BEST ANSWER

You can use match to find the index for words from allComments$Word in misspellings$Wrong and then use this index to subset them.

tt <- match(allComments$Word, misspellings$Wrong)
allComments$Word[!is.na(tt)]  <- misspellings$Right[tt[!is.na(tt)]]
allComments
#  Number         Word
#1      1 organization
#2      2 organization
#3      3 organization
#4      4         help
#5      5         help

In case the right word is not already in allComments$Word cast it to a character:

allComments$Word <- as.character(allComments$Word)
0
On

Here is another base R solution using replace()

allComments <- within(allComments, 
                      Word <- replace(Word,
                                      which(!is.na(match(Word,misspellings$Wrong))),
                                      na.omit(misspellings$Right[match(Word,misspellings$Wrong)])))

such that

> allComments
  Number         Word
1      1 organization
2      2 organization
3      3 organization
4      4         help
5      5         help
0
On

Here's a data.table solution:

library(data.table)
setDT(allComments)
setDT(misspellings)
df <- merge.data.table(allComments, misspellings, all.x = T, by.x = "Word", by.y = "Wrong")
df <- df[!(is.na(Right)), Word := Right]
df <- df[, c("Number", "Word")]
df <- df[order(Number)]
df

#    Number         Word
#1:      1  organization
#2:      2  organization
#3:      3  organization
#4:      4          help
#5:      5          help
0
On
allComments %>%
  left_join(misspellings, by = c("Word" = "Wrong")) %>%
  mutate(Word = coalesce(as.character(Right), Word))
#   Number         Word        Right
# 1      1 organization         <NA>
# 2      2 organization organization
# 3      3 organization organization
# 4      4         help         <NA>
# 5      5         help         help

You can, of course, drop the Right column when you're done with it.