r: character matching with dictionary word position

281 Views Asked by At

I have two dataframes,

word_table <- word_9 word_1 word_3 ...word_random word_2 na na ...word_random word_5 word_3 na ...word_random

dictionary_words <- word_2 word_3 word_4 word_6 word_7 word_8 word_9 . . . word_n what I am looking for, matching the word_table with the dictionary_words and replacing the words with the word position available in the dictionary, like this,

result <- 7 na 2 ... 1 na na ... na 2 na ...

I have tried pmatch, charmatch, match functions, that returning result right way when the dictionary_words are in smaller length, but when it is relatively long like more than 20000 words, the result is coming only for first column, and rest of the columns are just becoming na like this.

result <- 7 na na ... 1 na na ... na na na ...

is there any other way I can do character matching, like using any apply function?

sample

word_table <- data.frame(word_1 <- c("conflict","", "resolved", "", "", ""), word_2 <- c("", "one", "tricky", "one", "", "one"), 
                 word_3 <- c("thanks","", "", "comments", "par",""),word_4 <- c("thanks","", "", "comments", "par",""), word_5 <- c("", "one", "tricky", "one", "", "one"), stringsAsFactors = FALSE)
colnames(word_table) <- c("word_1", "word_2", "word_3", "word_4", "word_5")
## Targeted Words
dictionary_words <- data.frame(cbind(c("abovementioned","abundant","conflict", "thanks", "tricky", "one", "two", "three","four", "resolved")))

## convert into matrix (if needed)
word_table <- as.matrix(word_table)
dictionary_words <- as.matrix(dictionary_words)

## pmatch for each of the element in the dataframe (dt)
# matched_table <- pmatch(dt, TargetWord)
# dim(matched_table) <- dim(dt)
# print(matched_table) 

result <- `dim<-`(pmatch(word_table, dictionary_words, duplicates.ok=TRUE), dim(word_table))
print(result) # working fine, but when the dictionary_words is large, returning result for only first column of the word_table
1

There are 1 best solutions below

6
On

Here is a reproducible example:

 word_table <- structure(list(V1 = structure(c(3L, 1L, 2L), .Label = c("word_2", 
                                                    "word_5", "word_9"), class = "factor"), V2 = structure(c(1L, 
                                                                                                             NA, 2L), .Label = c("word_1", "word_3"), class = "factor"), V3 = structure(c(1L, 
                                                                                                                                                                                          NA, NA), .Label = "word_3", class = "factor"), V4 = structure(c(1L, 
                                                                                                                                                                                                                                                          1L, 1L), .Label = "...word_random", class = "factor")), .Names = c("V1", 
                                                                                                                                                                                                                                                                                                                             "V2", "V3", "V4"), class = "data.frame", row.names = c(NA, -3L
                                                                                                                                                                                                                                                                                                                             ))

 dictionary_words <- structure(list(V1 = structure(1:7, .Label = c("word_2", "word_3", 
                                                              "word_4", "word_6", "word_7", "word_8", "word_9"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                        -7L))

You can use sapply :

> sapply(word_table, function(x) match(x, dictionary_words[, 1]))
     V1 V2 V3 V4
[1,]  7 NA  2 NA
[2,]  1 NA NA NA
[3,] NA  2 NA NA

or apply if you prefer:

> apply(word_table, 2, function(x) match(x, dictionary_words[, 1]))
V1 V2 V3 V4
[1,]  7 NA  2 NA
[2,]  1 NA NA NA
[3,] NA  2 NA NA