How can I extract 2-4 words on each side of a specific term in R?

346 Views Asked by At

How can I extract 2-4 words on each side of a specific term from a string/corpus in R?

Here is an example:

I would like to extract 2 words around 'converse'.

txt <- "Socially when people meet they should converse to present their
       views and listen to other people's opinions to enhance their perspective" 

Output should be like:

"they should converse to present"
4

There are 4 best solutions below

0
On BEST ANSWER

I guess this solves your problem:

/((?:\S+\s){2}converse(?:\s\S+){2})/

Demo: https://regex101.com/r/tS9kB0/1

If you need other weights on either side, I guess you can see what to change.

0
On

The qdapRegex package (that I maintain) has a canned regular expression for grabbing words before/after a word and can be used via:

library(qdapRegex)

grab2 <- rm_(pattern=S("@around_", 2, "converse", 2), extract=TRUE)
grab2(txt)

## [[1]]
## [1] "they should converse to present"

To see the regular expression used:

S("@around_", 2, "converse", 2)
[1] "(?:[^[:punct:]|\\s]+\\s+){0,2}(converse)(?:\\s+[^[:punct:]|\\s]+){0,2}"
1
On
sub('.*?(\\w+ \\w+) (converse) (\\w+ \\w+).*', '\\1 \\2 \\3', s)
[1] "they should converse to present"
0
On

This could be another way using strsplit

sapply(strsplit(txt, ' '), function(x) 
paste(x[(which(x %in% 'converse')-2):(which(x %in% 'converse')+2)], collapse= ' '))

#[1] "they should converse to present"