r: pmatch isn't working for big dataframe

255 Views Asked by At

I have two dataframes, first one (dt) contains of all chr and second one (TargetWord) is a dictionary contains chr as well. I have used pmatch to search in dt which words are available in the TargetWord and returning the position from TargetWord. It is working fine when dataframes are small. But problem starts when the dataframes are huge, it is returning the word position for only the first column, rest of the columns are becoming NA.

## Data Table
word_1 <- c("conflict","", "resolved", "", "", "")
word_2 <- c("", "one", "tricky", "one", "", "one")
word_3 <- c("thanks","", "", "comments", "par","")
word_4 <- c("thanks","", "", "comments", "par","")
word_5 <- c("", "one", "tricky", "one", "", "one")
dt <- data.frame(word_1, word_2, word_3,word_4, word_5, stringsAsFactors = FALSE)

## Targeted Words
TargetWord <- data.frame(cbind(c("conflict", "thanks", "tricky", "one", "two", "three")))

## convert into matrix (needed)
dt <- as.matrix(dt)
TargetWord <- as.matrix(TargetWord)

result <- `dim<-`(pmatch(dt, TargetWord, duplicates.ok=TRUE), dim(dt))
print(result)

Returning result,

     [,1] [,2] [,3] [,4] [,5]
[1,]    1   NA    2    2   NA
[2,]   NA    4   NA   NA    4
[3,]   NA    3   NA   NA    3
[4,]   NA    4   NA   NA    4
[5,]   NA   NA   NA   NA   NA
[6,]   NA    4   NA   NA    4

Now after reading two .csv as bellow, result is just for the first column where I want it for all columns like above result. Bellow, dt1 = 79*50 dataframe, and word_dict 13901*1 dataframe.

#################### on big data #####################################
dt1 <- read.csv("C:/Users/Wonderland/Downloads/string_feature.csv", stringsAsFactors = FALSE)
word_dict <- read.csv("C:/Users/Wonderland/Downloads/word_dict.csv", stringsAsFactors = FALSE)

dt1 <- as.matrix(dt1)
word_dict <- as.matrix(word_dict)

result <- `dim<-`(pmatch(dt1, word_dict, duplicates.ok=TRUE), dim(dt1))
print(result)
2

There are 2 best solutions below

0
On

pmatch currently works olny for sizes up to 100.

pmatch(rep("a", 100), rep("a", 100))
#  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
# [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
# [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
# [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
# [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
# [91]  91  92  93  94  95  96  97  98  99 100

pmatch(rep("a", 101), rep("a", 101))
#  [1]  1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
#[101] NA
0
On

Try with apply:

apply(dt,2,function(x) pmatch(x,TargetWord,duplicates.ok = T))

As you can see, the result is the same but it probably works with huge dataframe

     word_1 word_2 word_3 word_4 word_5
[1,]      1     NA      2      2     NA
[2,]     NA      4     NA     NA      4
[3,]     NA      3     NA     NA      3
[4,]     NA     NA     NA     NA     NA
[5,]     NA     NA     NA     NA     NA
[6,]     NA     NA     NA     NA     NA

I tried with:

word_1 <- rep(c("conflict","", "resolved", "", "", ""),1000)
word_2 <- rep(c("", "one", "tricky", "one", "", "one"),1000)
word_3 <- rep(c("thanks","", "", "comments", "par",""),1000)
word_4 <- rep(c("thanks","", "", "comments", "par",""),1000)
word_5 <- rep(c("", "one", "tricky", "one", "", "one"),1000)

with all the same code and it worked.