Why is publishDir not copying large directories?

44 Views Asked by At

I am using publishDir to copy an output directory (that has subdirectories) from my Cellranger process to a specified directory. For smaller output directories, this works, but not for a larger one (14G total). I am using the Slurm executor and using a shared file system.

What I've tried so far:

  • I have cleared the cache
  • Changed my OpenJDK to Corretto (it was previously JetBrains)
  • Tried another file system
  • Tried other runs of the same size from the same directory
  • Tried smaller runs (successful)

Here is my nextflow script:

params.run_dir = ''
params.sample_id = ''
params.csv = ''
params.workflow = ''
params.gex_outdir = '/work/srb108/GEX'


process makefastq_gex {
    cache 'lenient'
    beforeScript 'module load Cell-Ranger/7.2.0'
    publishDir "${params.gex_outdir}", mode: 'copy'

    input:
    path run_dir
    val sample_id
    path csv
    
    output:
    path "${sample_id}_mkfastq_outs"

    script:
    """
    cellranger mkfastq --run=${run_dir} --csv=${csv} --output-dir=${sample_id}_mkfastq_outs --delete-undetermined
    """

} 


workflow {
    run_dir = Channel.fromPath(params.run_dir)
    sample_id = params.sample_id
    csv = Channel.fromPath(params.csv)
    workflow = params.workflow

    if(params.workflow == "gex") {
        makefastq_gex(run_dir, sample_id, csv)
    }  
    
}

Here is my config file:

workDir = '/work/srb108'

process {
    executor = 'slurm'

    withName: makefastq_gex {
        cpus = 10
        memory = '250 GB'
        clusterOptions = '--partition=dhvi --job-name=test_mkfastq_job --mail-type=END [email protected]'
    }

}

This is the ending portion of a log file for a run in which publishDir was SUCCESSFUL:

Mar-18 14:52:11.105 [Task monitor] DEBUG nextflow.util.ThreadPoolBuilder - Creating thread pool 'PublishDir' minSize=10; maxSize=10; workQueue=LinkedBlockingQueue[10000]; allowCoreThreadTimeout=false
Mar-18 14:52:11.254 [main] DEBUG nextflow.Session - Session await > all processes finished
Mar-18 14:52:11.280 [Task monitor] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: slurm) - terminating tasks monitor poll loop
Mar-18 14:52:11.285 [main] DEBUG nextflow.Session - Session await > all barriers passed
Mar-18 14:52:16.328 [main] INFO  nextflow.util.ThreadPoolHelper - Waiting for file transfers to complete (1 files)
^ THIS LINE REPEATS A FEW TIMES
Mar-18 14:58:42.520 [main] DEBUG nextflow.util.ThreadPoolManager - Thread pool 'PublishDir' shutdown completed (hard=false)
Mar-18 14:58:43.105 [main] DEBUG n.trace.WorkflowStatsObserver - Workflow completed > WorkflowStats[succeededCount=1; failedCount=0; ignoredCount=0; cachedCount=0; pendingCount=0; submittedCount=0; runningCount=0; retriesCount=0; abortedCount=0; succeedDuration=7h 41m 42s; failedDuration=0ms; cachedDuration=0ms;loadCpus=0; loadMemory=0; peakRunning=1; peakCpus=10; peakMemory=100 GB; ]
Mar-18 14:58:43.731 [main] DEBUG nextflow.cache.CacheDB - Closing CacheDB done
Mar-18 14:58:44.215 [main] DEBUG nextflow.util.ThreadPoolManager - Thread pool 'FileTransfer' shutdown completed (hard=false)
Mar-18 14:58:44.216 [main] DEBUG nextflow.script.ScriptRunner - > Execution complete -- Goodbye

However, I do not see this ending portion for the runs whose results do not get published as indicated in this log file. These are the last few lines and that's it:

Mar-25 16:02:02.696 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor slurm > tasks to be completed: 1 -- submitted tasks are shown below
~> TaskHandler[jobId: 5508033; id: 1; name: makefastq_gex (1); status: RUNNING; exit: -; error: -; workDir: /work/srb108/3e/ec9f059261a38bd6ff58c6ced0911b started: 1711387919417; exited: -; ]
Mar-25 16:07:02.702 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor slurm > tasks to be completed: 1 -- submitted tasks are shown below
~> TaskHandler[jobId: 5508033; id: 1; name: makefastq_gex (1); status: RUNNING; exit: -; error: -; workDir: /work/srb108/3e/ec9f059261a38bd6ff58c6ced0911b started: 1711387919417; exited: -; ]
Mar-25 16:12:02.709 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor slurm > tasks to be completed: 1 -- submitted tasks are shown below
~> TaskHandler[jobId: 5508033; id: 1; name: makefastq_gex (1); status: RUNNING; exit: -; error: -; workDir: /work/srb108/3e/ec9f059261a38bd6ff58c6ced0911b started: 1711387919417; exited: -; ]

There is nothing in .command.err.

It looks like the publishDir thread pool is not being created for these larger runs. I'm at a loss for where to go from here.

0

There are 0 best solutions below