how to split texts in an increasing manner?

43 Views Asked by At

I have a list of texts read into the software using readtext library.

files <-readtext(paste0(wd), "/r/*.pdf", ignore_missing_files = FALSE, text_field = "texts")

The 100 pdf files are of different unequal sizes that vary from 6000 to 40000 words. I need to chunk them in an increasing manner.

files.ch <- as.character(files$text)
library(stringi)

chunking <- function(c){
  words.split <- str_split(c, pattern = boundary(type = "word"))
  chunked <- sapply(seq(1000, length(words.split[[1]]), 1000), lapply, function(x) paste(words.split[[1]][1:x], collapse = " "))
  return(chunked)
}
chunked <- lapply(files.ch, chunking)

Error in seq.default(1000, length(words.split[[1]]), 1000) : 
  wrong sign in 'by' argument 

here the by argument —length(words.split[[1]]— seems to be problematic. Because the text sizes are not equal, the length of one text won't work for the one which is longer. So, I need to debug this so that the function runs. I cannot have a fixed by value for all the attributes in my list. I need this function to change the by value according to the index of the attribute that goes into the function. I mean length(words.split[[1]]) for the first in the list, length(words.split[[2]] for the second and so on. thanks in advance for your time and help.

0

There are 0 best solutions below