How to reshape Quanteda corpus by values of a docvar?

34 Views Asked by At

I'm working with a text corpus in R using the quanteda package. Suppose this corpus contains some texts that are split into sentences. Using corpus_reshape() it is easy enough, in theory, to switch between sentences and actual documents as unit of analysis. However, what if I want to reshape my corpus depending on the values of a specific variable in my docvars?

# Load quanteda

library(quanteda)

Package version: 3.3.1
Unicode version: 13.0
ICU version: 69.1
Parallel computing: 8 of 8 threads used.

# Create a simulated corpus
texts <- c(
  "Document one text. It has several sentences. Here is another sentence.",
  "Document two is slightly longer. It has more sentences. This is the third sentence. And here is the fourth."
)

# Create dummy across documents
docvars <- data.frame(dummy_var = c(1,0,1,0,1,0,1))

# Create the corpus
my_corpus <- corpus(texts) %>% corpus_reshape(to = "sentences")

# Define docvars
docvars(my_corpus) <- docvars

# Reshape to document parts based on dummy_var?
...

The desired output would be a new corpus where each document is split into two parts based on the dummy variable, resulting in a total of 4 documents in this case.

Could someone suggest an efficient way to do this in quanteda?

My idea is to split the documents into "halves" > tokenize > dfm to prepare them for scaling (e.g. wordfish), to see if the group the sentences fall into makes any difference. Specifically, my question is: Is there more left-right variance when it comes to dummy_var == 0 than dummy_var == 1?

Please let me know if there're any flaws in that approach.

1

There are 1 best solutions below

3
Dr. Fabian Habersack On

I think I found a solution to my problem using a combination of tidyverse and quanteda. Here's what it looks like:

library(tidyverse)

docvars(my_corpus) <- docvars(my_corpus) %>%
  mutate(index = paste0(docid(my_corpus), "_", dummy_var))

corpus_group(my_corpus, index)

Still not entirely sure, however, if my approach is correct all in all.