Deleting documents from corpus using docvars in R (quanteda package)

614 Views Asked by At

I would like to know if it is possible to delete documents from a corpus if the text is in fact "empty". I am building a corpus of texts in order to subsequently run some textmodels using quanteda package in R. The texts are in a column of a csv file and are imported as follows:

> mycorpus<-corpus(readtext("tablewithdocuments.csv",text_field="textcolumn"))
> mycorpus
Corpus consisting of 25 documents and 14 docvars.

I know how to erase empty texts from the dfm of the corpus, but I want to have a new corpus which is a subset of the original one excluding documents with a missing cell in the csv column "textcolumn".

In practice, from something as the following corpus:

library("quanteda")

text <- c(
  doc1 = "",
  doc2 = "pinapples and pizzas taste good",
  doc3 = "but please do not mix them together"
)
mycorpus <- corpus(text)

mycorpus
## Corpus consisting of 3 documents and 0 docvars.

summary(mycorpus)
## Corpus consisting of 3 documents:
## Text Types Tokens Sentences
## doc1     0      0         0
## doc2     4      4         1
## doc3     5      5         1

I would like to obtain a new corpus with only doc2 and doc3 in it.

Thank you in advance for you help.

Best wishes,

Michele

0

There are 0 best solutions below