Extract a 100-Character Window around Keywords in Text Data with R (Quanteda or Tidytext Packages)

181 Views Asked by At

This is my first time asking a question on here so I hope I don't miss any crucial parts. I want to perform sentiment analysis on windows of speeches around certain keywords. My dataset is a large csv file containing a number of speeches, but I'm only interest in the sentiment of the words immediately surrounding certain key words.

I was told that the quanteda package in R would likely be my best bet for finding such a function, but I've been unsuccessful in locating it so far. If anyone knows how to do such a task it would be greatly appreciated !!!

Reprex (I hope?) below:

speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stackoverflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stackoverflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")

data <- data.frame(id=1:3, 
                   speechContent = speech)
2

There are 2 best solutions below

1
On BEST ANSWER

I'd suggest using tokens_select() with the window argument set to a range of tokens surrounding your target terms.

To take your example, if "stackoverflow" is the target term, and you want to measure sentiment in the +/- 10 tokens around that, then this would work:

library("quanteda")
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.

## [CODE FROM ABOVE]

corp <- corpus(data, text_field = "speechContent")

toks <- tokens(corp) %>%
  tokens_select("stackoverflow", window = 10)
toks
## Tokens consisting of 3 documents and 1 docvar.
## text1 :
##  [1] "One"           "relevant"      "word"          ","            
##  [5] "for"           "example"       ","             "is"           
##  [9] "the"           "word"          "stackoverflow" "."            
## [ ... and 9 more ]
## 
## text2 :
##  [1] "word"          "of"            "interest"      ","            
##  [5] "but"           "at"            "the"           "very"         
##  [9] "end"           "."             "stackoverflow" "."            
## 
## text3 :
## character(0)

There are many ways to compute sentiment from this point. An easy one is to apply a sentiment dictionary, e.g.

tokens_lookup(toks, data_dictionary_LSD2015) %>%
  dfm()
## Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 1 docvar.
##        features
## docs    negative positive neg_positive neg_negative
##   text1        0        1            0            0
##   text2        0        0            0            0
##   text3        0        0            0            0
2
On

Using quanteda:

library(quanteda)
corp <- corpus(data, docid_field = "id", text_field = "speechContent")

x <- kwic(tokens(corp, remove_punct = TRUE), 
          pattern = "stackoverflow",
          window = 3
          )

x
Keyword-in-context with 2 matches.                                                         
 [1, 29]  is the word | stackoverflow | However there are
 [2, 24] the very end | stackoverflow |  

 as.data.frame(x)

  docname from to          pre       keyword              post       pattern
1       1   29 29  is the word stackoverflow However there are stackoverflow
2       2   24 24 the very end stackoverflow                   stackoverflow

Now read the help for kwic (use ?kwic in console) to see what kind of patterns you can use. With tokens you can specify which data cleaning you want to use before using kwic. In my example I removed the punctuation.

The end result is a data frame with the window before and after the keyword(s). In this example a window of length 3. After that you can do some form of sentiment analyses on the pre and post results (or paste them together first).