I am creating a nextflow pipeline for analysing genomics data. The pipeline can carry out all steps up to creating BAM files and marking duplicates. However I am unable to carry the created BAM files over from my MARK_DUPLICATES step into the GATK MultiMetrics tool as the MULTMETRICS process declares 3 input channels but 5 were specified.
Process `MULTIMETRICS` declares 3 input channels but 5 were specified
The MARKDUPLICATES process
process MARK_DUPLICATES {
cpus 10
publishDir params.outdir, mode:'move'
container 'broadinstitute/gatk:latest'
input:
tuple val(sample_id), path(reads)
output:
path "${sample_id}_MarkedDup.bam"
path "${sample_id}_MarkedDuplicates.txt"
path "${sample_id}_MarkedDup.bai"
script:
"""
gatk MarkDuplicates I=${reads[0]} O=${sample_id}_MarkedDup.bam M=${sample_id}_MarkedDuplicates.txt CREATE_INDEX=true
"""
}
The MULTIMETRICS process:
process MULTIMETRICS {
container 'broadinstitute/gatk:latest'
input:
path "${sample_id}_MarkedDup.bam"
path(genome)
val genomeid
output:
tuple val(sample_id), path("${sample_id}_multimetrics")
script:
"""
gatk CollectMultipleMetrics I=${reads} O=${sample_id}_multimetrics R=${genome}/$genomeid
"""
}
The workflow:
picard_ch=MARK_DUPLICATES(addreadgroups_ch)
// picard_ch.view()
multimetrics_ch=MULTIMETRICS(picard_ch, params.genomefile, params.genomeid)
I see what's happening.
MARK_DUPLICATES
declares 3 separate files in the output declaration. Since you've not specified which you want to send toMULTIMETRICS
thepicard_ch
channel will have all 3 files. Though I do find it a little strange it's not trying to process each individually.This can be solved with more explicit output declarations. Here is one solution channeling all
MARK_DUPLICATES
outputs into a single channel:Alternatively, you could just emit the bam from
MARK_DUPLICATES
like this:The
MULTIMETRICS
input declaration would need to change to:And the workflow: