Snakemake integrate the multiple command lines in a rule

564 Views Asked by At

The output of my first command line "bcftools query -l {input.invcf} | head -n 1" prints the name of the first individual of vcf file (i.e. IND1). I want to use that output in selectvariants GATK in -sn IND1 option. How is it possible to integrate the 1st comamnd line in snakemake in order to use it's output in the next one?

rule selectvar:
    input:
        invcf="{family}_my.vcf"
    params:
        ind= ???
        ref="ref.fasta"
    output:
        out="{family}.dn.vcf"
    shell:
        """
        bcftools query -l {input.invcf} | head -n 1 > {params.ind}
        gatk --java-options "-Xms2G -Xmx2g -XX:ParallelGCThreads=2" SelectVariants -R {params.ref} -V {input.invcf} -sn {params.ind} -O {output.out}
        """
3

There are 3 best solutions below

0
On BEST ANSWER

There are several options, but the easiest one is to store the results into a temporary bash variable:

rule selectvar:
   ...
   shell:
        """
        myparam=$(bcftools query -l {input.invcf} | head -n 1)
        gatk -sn "$myparam" ...
        """

As noted by @dariober, if one modifies pipefail behaviour, there could be unexpected results, see the example in this answer.

2
On

I think I found a solution:

rule selectvar:
    input:
        invcf="{family}_my.vcf"
    params:
        ref="ref.fasta"
    output:
        out="{family}.dn.vcf"
    shell:
        """
        gatk --java-options "-Xms2G -Xmx2g -XX:ParallelGCThreads=2" SelectVariants -R {params.ref} -V {input.invcf} -sn `bcftools query -l {input.invcf} | head -n 1` -O {output.out}
        """
0
On

When I have to do these things I prefer to use run instead of shell, and then shell out only at the end. The reason for this is because it makes it possible for snakemake to lint the run statement, and to exit early if something goes wrong instead of following through with a broken shell command.

rule selectvar:
    input:
        invcf="{family}_my.vcf"
    params:
        ref="ref.fasta"
        gatk_opts='--java-options "-Xms2G -Xmx2g -XX:ParallelGCThreads=2" SelectVariants'
    output:
        out="{family}.dn.vcf"
    run:
        opts = "{params.gatk_opts} -R {params.ref} -V {input.invcf} -O {output.out}"
        sn_parameter = shell("bcftools query -l {input.invcf} | head -n 1")
        # we could add a sanity check here if necessary, before shelling out
        shell(f"gatk {options} {sn_parameter}")
        """