Assigning a Vector and Outputting Dataframe from foreach loop

63 Views Asked by At

I would like to assign a value to a vector, then add that vector to a data frame within a foreach loop. It looks like I will need a combine function, but I'm not sure which one.

My specific example involves counting the lines of a file, then moving the files if they are less than a certain amount.

Take 5 files in the raw directory, foo1,foo2, foo3,foo4, foo5. Let's pretend that foo1 has 100 lines, and the others have 1e6 lines. With 5 cores, ideally each file would be counted by one worker.

library(parallel)
library(foreach)

# Create cluster
mycluster <- makePSOCKcluster(nnodes = 5, names = 5, outfile="")
doParallel::registerDoParallel(cl = my.cluster)

# Set cut off for number of lines
line.cutoff <- 1e5

# list the files to count
files.to.count <- list.files("raw")

# Make directory to move files into if they are < line.cutoff
system(paste("mkdir files-with-less-than", line.cutoff,"-lines", sep = ""))


# Set up the final df
lines.and.names <- data.frame(matrix(ncol=2, nrow = 0))


# Loop over each file, but in parallel
foreach(i = 1:length(files.to.count)) %dopar%
  { 
    file <- files.to.count[i]
    
    # Count the reads in file
    lines <- system(paste("wc -l ", file, sep = ""), intern = TRUE)
    
    # Move to new directory if less than min.reads
    if(as.numeric(lines) <= line.cutoff){
      system(paste("mv ", file, " files-with-less-than", line.cutoff,"-lines", sep = ""))
    } else {
      NULL
    }
    
    # Add the name and reads to a dataframe
    lines.and.names <- rbind(lines.and.names, data.frame(name = file, lines.in.file = lines))
    
  }

# Write the final tsv of the line numbers and filenames
write_tsv(lines.and.names, "lines_and_names.txt")


stopCluster(mycluster)

The desired result would be foo1 moved to the directory files-with-less-than-1e+5-lines, the rest of the files in raw, and file called lines_and_names.txt with this text:

name    lines.in.file
foo1    100
foo2   1000000
foo3   1000000
foo4   1000000
foo5   1000000

If there are easier ways to parallelize this command, I'm open to suggestions there too.

0

There are 0 best solutions below