I'm working with a text corpus in R using the quanteda package. Suppose this corpus contains some texts that are split into sentences. Using corpus_reshape() it is easy enough, in theory, to switch between sentences and actual documents as unit of analysis. However, what if I want to reshape my corpus depending on the values of a specific variable in my docvars?
# Load quanteda
library(quanteda)
Package version: 3.3.1
Unicode version: 13.0
ICU version: 69.1
Parallel computing: 8 of 8 threads used.
# Create a simulated corpus
texts <- c(
"Document one text. It has several sentences. Here is another sentence.",
"Document two is slightly longer. It has more sentences. This is the third sentence. And here is the fourth."
)
# Create dummy across documents
docvars <- data.frame(dummy_var = c(1,0,1,0,1,0,1))
# Create the corpus
my_corpus <- corpus(texts) %>% corpus_reshape(to = "sentences")
# Define docvars
docvars(my_corpus) <- docvars
# Reshape to document parts based on dummy_var?
...
The desired output would be a new corpus where each document is split into two parts based on the dummy variable, resulting in a total of 4 documents in this case.
Could someone suggest an efficient way to do this in quanteda?
My idea is to split the documents into "halves" > tokenize > dfm to prepare them for scaling (e.g. wordfish), to see if the group the sentences fall into makes any difference. Specifically, my question is: Is there more left-right variance when it comes to dummy_var == 0 than dummy_var == 1?
Please let me know if there're any flaws in that approach.
I think I found a solution to my problem using a combination of tidyverse and quanteda. Here's what it looks like:
Still not entirely sure, however, if my approach is correct all in all.