I am using R and writing a script that counts if one of ~2000 words occurs in each row of a 4 million observation data file. The data set with observations (df) contains two columns, one with text (df$lead_paragraph), and one with a date (df$date).
Using the following, I can count if any of the words in a list (p) occur in each row of the lead_paragraph column of the df file, and output the answer as a new column.
df$pcount<-((rowSums(sapply(p, grepl, df$lead_paragraph,
ignore.case=TRUE) == TRUE, na.rm=T) > 0) * 1)
However, if I include too many words in the list p, running the code crashes R.
My alternate strategy is to simply break this into pieces, but I was wondering if there is a better, more elegant coding solution to use here. My inclination is to use a for loop, but everything I am reading suggests this is not preferred in R. I am pretty new to R and not a very good coder, so my apologies if this is not clear.
df$pcount1<-((rowSums(sapply(p[1:100], grepl, df$lead_paragraph,
ignore.case=TRUE) == TRUE, na.rm=T) > 0) * 1)
df$pcount2<-((rowSums(sapply(p[101:200], grepl, df$lead_paragraph,
ignore.case=TRUE) == TRUE, na.rm=T) > 0) * 1)
...
df$pcount22<-((rowSums(sapply(p[2101:2200], grepl, df$lead_paragraph,
ignore.case=TRUE) == TRUE, na.rm=T) > 0) * 1)
I didn't complete this... but this should point you in the right direction. It's faster using the
data.table
package, but hopefully this gives you an idea of the process.I recreated your dataset using random dates and strings which were extracted from http://www.norvig.com/big.txt into a data.frame named
nrv_df
Then, to leverage the
stringi
package and using a regex to match complete cases of the words, I joined each of the strings in vectorp
, and collapsed then with a|
, so that we are looking for any words with aword-boundary
before or after:And then simply count the words and you could do
nrv_df$counts <-
to add this as a column...EDIT:
Since it's of no consequence to find the number of matches... First a function to do the work to each paragraph and detect if any of the stirngs in
p2
exist in the body oflead_paragraph
Now... using the
parallel
library on linux. And only testing 1000 rows since it's an example gives us: