I have a list of texts read into the software using readtext
library.
files <-readtext(paste0(wd), "/r/*.pdf", ignore_missing_files = FALSE, text_field = "texts")
The 100 pdf files are of different unequal sizes that vary from 6000 to 40000 words. I need to chunk them in an increasing manner.
files.ch <- as.character(files$text)
library(stringi)
chunking <- function(c){
words.split <- str_split(c, pattern = boundary(type = "word"))
chunked <- sapply(seq(1000, length(words.split[[1]]), 1000), lapply, function(x) paste(words.split[[1]][1:x], collapse = " "))
return(chunked)
}
chunked <- lapply(files.ch, chunking)
Error in seq.default(1000, length(words.split[[1]]), 1000) :
wrong sign in 'by' argument
here the by
argument —length(words.split[[1]]— seems to be problematic. Because the text sizes are not equal, the length of one text won't work for the one which is longer. So, I need to debug this so that the function runs.
I cannot have a fixed by
value for all the attributes in my list. I need this function to change the by
value according to the index of the attribute that goes into the function. I mean length(words.split[[1]]) for the first in the list, length(words.split[[2]] for the second and so on. thanks in advance for your time and help.