I'm performing text analysis using the quanteda
package in R.
I have a set of text documents that I already tokenized. Each consists of a different amount of tokens. I want to split the tokens into N equal chunks of tokens (e.g. 10 or 20 chunks that consist of an equal amount of tokens for each text).
Assume my data is called text_docs
and looks as follows:
Text | Tokens
Text1 | "this" "is" "an" "example" "this" "is" "an" "example"
Text2 | "this" "is" "an" "example"
Text3 | "this" "is" "an" "example" "this" "is" "an" "example" "this" "is" "an" "example"
The results that I would like to get should look like this (with two chunks instead of twenty):
Text | Chunk1 | Chunk2
Text1 | "this" "is" "an" "example" | "this" "is" "an" "example"
Text2 | "this" "is" | "an" "example"
Text3 | "this" "is" "an" "example" "this" "is" | "an" "example" "this" "is" "an" "example"
I'm aware of the tokens_chunk
function in quanteda
. Yet, this function only enables me to create a set of chunks of equal size (e.g. each chunk consists of two tokens), which leaves me with a different amount of chunks. Furthermore, the command size
in the tokens_chunk
function has to be a single integer, which is why I can't simply do this chunks <- tokens_chunk(text_docs, size = ntokens(text_docs)/20)
.
Any idea?
Thank you in advance.
Here's one way to do what you want. We will lapply over the docnames to slice out each document, and then split it using
tokens_chunk()
with a size equal to half of its length. Here, I also useceiling
so that if the token length is odd for a document, it will have one more token in its first split than in its second. (Your example was all for even-tokened documents, but this handles the odd-tokened case too.)That results in a list of split tokens, and you can recombine them by using the
c()
function which concatenates tokens. You apply this to the list usingdo.call()
.