I would like to assign a value to a vector, then add that vector to a data frame within a foreach
loop. It looks like I will need a combine function, but I'm not sure which one.
My specific example involves counting the lines of a file, then moving the files if they are less than a certain amount.
Take 5 files in the raw
directory, foo1
,foo2
, foo3
,foo4
, foo5
. Let's pretend that foo1
has 100 lines, and the others have 1e6 lines. With 5 cores, ideally each file would be counted by one worker.
library(parallel)
library(foreach)
# Create cluster
mycluster <- makePSOCKcluster(nnodes = 5, names = 5, outfile="")
doParallel::registerDoParallel(cl = my.cluster)
# Set cut off for number of lines
line.cutoff <- 1e5
# list the files to count
files.to.count <- list.files("raw")
# Make directory to move files into if they are < line.cutoff
system(paste("mkdir files-with-less-than", line.cutoff,"-lines", sep = ""))
# Set up the final df
lines.and.names <- data.frame(matrix(ncol=2, nrow = 0))
# Loop over each file, but in parallel
foreach(i = 1:length(files.to.count)) %dopar%
{
file <- files.to.count[i]
# Count the reads in file
lines <- system(paste("wc -l ", file, sep = ""), intern = TRUE)
# Move to new directory if less than min.reads
if(as.numeric(lines) <= line.cutoff){
system(paste("mv ", file, " files-with-less-than", line.cutoff,"-lines", sep = ""))
} else {
NULL
}
# Add the name and reads to a dataframe
lines.and.names <- rbind(lines.and.names, data.frame(name = file, lines.in.file = lines))
}
# Write the final tsv of the line numbers and filenames
write_tsv(lines.and.names, "lines_and_names.txt")
stopCluster(mycluster)
The desired result would be foo1
moved to the directory files-with-less-than-1e+5-lines
, the rest of the files in raw
, and file called lines_and_names.txt
with this text:
name lines.in.file
foo1 100
foo2 1000000
foo3 1000000
foo4 1000000
foo5 1000000
If there are easier ways to parallelize this command, I'm open to suggestions there too.