Count if a word occurs in each row of a 4 million observation data set

102 Views Asked by At

I am using R and writing a script that counts if one of ~2000 words occurs in each row of a 4 million observation data file. The data set with observations (df) contains two columns, one with text (df$lead_paragraph), and one with a date (df$date).

Using the following, I can count if any of the words in a list (p) occur in each row of the lead_paragraph column of the df file, and output the answer as a new column.

   df$pcount<-((rowSums(sapply(p, grepl, df$lead_paragraph, 
   ignore.case=TRUE) == TRUE, na.rm=T) > 0) * 1)

However, if I include too many words in the list p, running the code crashes R.

My alternate strategy is to simply break this into pieces, but I was wondering if there is a better, more elegant coding solution to use here. My inclination is to use a for loop, but everything I am reading suggests this is not preferred in R. I am pretty new to R and not a very good coder, so my apologies if this is not clear.

    df$pcount1<-((rowSums(sapply(p[1:100], grepl, df$lead_paragraph, 
    ignore.case=TRUE) == TRUE, na.rm=T) > 0) * 1)
    df$pcount2<-((rowSums(sapply(p[101:200], grepl, df$lead_paragraph, 
    ignore.case=TRUE) == TRUE, na.rm=T) > 0) * 1) 
    ...
    df$pcount22<-((rowSums(sapply(p[2101:2200], grepl, df$lead_paragraph, 
    ignore.case=TRUE) == TRUE, na.rm=T) > 0) * 1)
2

There are 2 best solutions below

1
On BEST ANSWER

I didn't complete this... but this should point you in the right direction. It's faster using the data.table package, but hopefully this gives you an idea of the process.

I recreated your dataset using random dates and strings which were extracted from http://www.norvig.com/big.txt into a data.frame named nrv_df

library(stringi)

> head(nrv_df)
                                                             lead_para       date
1     The Project Gutenberg EBook of The Adventures of Sherlock Holmes 2018-11-16
2                                            by Sir Arthur Conan Doyle 2019-06-05
3                           15 in our series by Sir Arthur Conan Doyle 2017-08-08
4  Copyright laws are changing all over the world Be sure to check the 2014-12-17
5 copyright laws for your country before downloading or redistributing 2016-09-13
6                            this or any other Project Gutenberg eBook 2015-06-15

> dim(nrv_df)
[1] 103598      2

I then randomly sampled words from the entire body to get 2000 unique words
> length(p)
[1] 2000
> head(p)
[1] "The"        "Project"    "Gutenberg"  "EBook"      "of"         "Adventures"
> tail(p)
[1] "accomplice" "engaged"    "guessed"    "row"        "moist"      "red"   

Then, to leverage the stringi package and using a regex to match complete cases of the words, I joined each of the strings in vector p, and collapsed then with a |, so that we are looking for any words with a word-boundary before or after:

> p_join2 <- stri_join(sprintf("\\b%s\\b", p), collapse = "|")
> p_join2

[1] "\\bThe\\b|\\bProject\\b|\\bGutenberg\\b|\\bEBook\\b|\\bof\\b|\\bAdventures\\b|\\bSherlock\\b|\\bHolmes\\b|\\bby\\b|\\bSir\\b|\\bArthur\\b|\\bConan\\b|\\bDoyle\\b|\\b15\\b|\\bin\\b|\\bour\\b|\\bseries\\b|\\bCopyright\\b|\\blaws\\b|\\bare\\b|\\bchanging\\b|\\ball\\b|\\bover\\b|\\bthe\\b|\\bworld\\b|\\bBe\\b|\\bsure\\b|\\bto\\b|\\bcheck\\b|\\bcopyright\\b|\\bfor\\b|\\byour\\b|\\bcountry\\b|..."

And then simply count the words and you could do nrv_df$counts <- to add this as a column...

> stri_count_regex(nrv_df$lead_para[25000:26000], p_join2, stri_opts_regex(case_insensitive = TRUE))
[1] 12 11  8 13  7  7  6  7  6  8 12  1  6  7  8  3  5  3  5  5  5  4  7  5  5  5  5  5 10  2  8 13  5  8  9  7  6  5  7  5  9  8  7  5  7  8  5  6  0  8  6
[52]  3  4  0 10  7  9  8  4  6  8  8  7  6  6  6  0  3  5  4  7  6  5  7 10  8 10 10 11

EDIT:

Since it's of no consequence to find the number of matches... First a function to do the work to each paragraph and detect if any of the stirngs in p2 exist in the body of lead_paragraph

f <- function(i, j){
     if(any(stri_detect_fixed(i, j, omit_no_match = TRUE))){
         1
     }else {
         0
     }
 }

Now... using the parallel library on linux. And only testing 1000 rows since it's an example gives us:

library(parallel)
library(stringi)
> rst <- mcmapply(function(x){
    f(i = x, j = p2)
}, vdf2$lead_paragraph[1:1000], 
mc.cores = detectCores() - 2,
USE.NAMES = FALSE)
> rst
   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  [70] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [139] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
 [208] 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [277] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [346] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1
 [415] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [484] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [553] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [622] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [691] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [760] 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [829] 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [898] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
 [967] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0
On

This also works:

library(corpus)

# simulate the problem as in @carl-boneri's answer
lead_para <- readLines("http://www.norvig.com/big.txt")

# get a random sample of 2000 word types
types <- text_types(lead_para, collapse = TRUE)
p <- sample(types, 2000)

# find whether each entry has at least one of the terms in `p`
ix <- text_detect(lead_para, p)

Even only using a single core, it's over 20 times faster than the previous solution:

system.time(ix <- text_detect(lead_para, p))
##  user  system elapsed 
## 0.231   0.008   0.240

system.time(rst <- mcmapply(function(x) f(i = x, j = p_join2),
                            lead_para, mc.cores = detectCores() - 2,
                            USE.NAMES = FALSE))
##   user  system elapsed 
## 11.604   0.240   5.805