I am trying to replicate this paper
In the tokens.R script it's cleaning up the corpus with the following command:
texts(corp) <- stri_replace_all_regex(texts(corp), "^[\\p{Lu}\\p{Z}]+(.{0,30}?)(\\(.{0,50}?\\))?(--)", "")
Which yields the following error message:
Error in attributes(.Data) <- c(attributes(.Data), attrib) :
'names' attribute [387896] must be the same length as the vector [4]
In addition: Warning message:
'texts.corpus' ist veraltet.
Benutzen Sie stattdessen 'as.character'
Siehe help("Deprecated")
So I naively apply the 'as.character' function like this:
as.character(corp) <- stri_replace_all_regex(as.character(corp), "^[\\p{Lu}\\p{Z}]+(.{0,30}?)(\\(.{0,50}?\\))?(--)", "")
Which yields the following error
Error in attributes(.Data) <- c(attributes(.Data), attrib) :
'names' attribute [387896] must be the same length as the vector [4]
I tried some other things, like only adressing $documents within the corpus or turning the corpus into a vector but none of that really worked.
How can I get around this?
Thank you in advance.
The "corpus" being loaded in the linked .R file
tokens.Ris using a very old format corpus object (fromdata/corpus_nytimes_summary.RDS).You can convert this into a new format corpus using:
Then replace the texts using this approach:
The use of
corp[]replaces the character part ofcorpwithout stripping the additional attributes (metadata and docvars) that make the character objectcorpa quanteda corpus.