I want to connect the realigner process with the indel reallignement. This is the rules:
rule gatk_IndelRealigner:
input:
tumor="mapped_reads/merged_samples/{tumor}.sorted.dup.reca.bam",
normal="mapped_reads/merged_samples/{normal}.sorted.dup.reca.bam",
id="mapped_reads/merged_samples/operation/{tumor}_{normal}.realign.intervals"
output:
"mapped_reads/merged_sample/CoClean/{tumor}.sorted.dup.reca.cleaned.bam",
"mapped_reads/merged_sample/CoClean/{normal}.sorted.dup.reca.cleaned.bam",
params:
genome=config['reference']['genome_fasta'],
mills= config['mills'],
ph1_indels= config['know_phy'],
log:
"mapped_reads/merged_samples/logs/{tumor}.indel_realign_2.log"
threads: 8
shell:
"gatk -T IndelRealigner -R {params.genome} "
"-nt {threads} "
"-I {input.tumor} -I {input.normal} -known {params.ph1_indels} -known {params.mills} -nWayOut .cleaned.bam --maxReadsInMemory 500000 --noOriginalAligmentTags --targetIntervals {input.id} >& {log} "
This is the error:
Not all output files of rule gatk_IndelRealigner contain the same wildcards.
I suppose I need to use also the {tumor}_{normal} but I can't use. Snakemake:
rule all:
input:expand("mapped_reads/merged_samples/CoClean/{sample}.sorted.dup.reca.cleaned.bam",sample=config['samples']),
expand("mapped_reads/merged_samples/operation/{sample[1][tumor]}_{sample[1][normal]}.realign.intervals", sample=read_table(config["conditions"], ",").iterrows())
config.yml
conditions: "conditions.csv"
conditions.csv
tumor,normal
411,412
Here you can see an example of the code (for testing purpose) gave the same error:
directory
$ tree prova/
prova/
├── condition.csv
├── config.yaml
├── output
│ ├── ABC.bam
│ ├── pippa.bam
│ ├── Pippo.bam
│ ├── TimBorn.bam
│ ├── TimNorm.bam
│ ├── TimTum.bam
│ └── XYZ.bam
└── Snakefile
this is snakemake
$ cat prova/Snakefile
from pandas import read_table
configfile: "config.yaml"
rule all:
input:
expand("{pathDIR}/{sample[1][tumor]}_{sample[1][normal]}.bam", pathDIR=config["pathDIR"], sample=read_table(config["sampleFILE"], " ").iterrows()),
expand("CoClean/{sample[1][tumor]}.bam", sample=read_table(config["sampleFILE"], " ").iterrows()),
expand("CoClean/{sample[1][normal]}.bam", sample=read_table(config["sampleFILE"], " ").iterrows())
rule gatk_RealignerTargetCreator:
input:
"{pathGRTC}/{normal}.bam",
"{pathGRTC}/{tumor}.bam",
output:
"{pathGRTC}/{tumor}_{normal}.bam"
# wildcard_constraints:
# tumor = '[^_|-|\/][0-9a-zA-Z]*',
# normal = '[^_|-|\/][0-9a-zA-Z]*'
run:
call('touch ' + str(wildcard.tumor) + '_' + str(wildcard.normal) + '.bam', shell=True)
rule gatk_IndelRealigner:
input:
t1="output/{tumor}.bam",
n1="output/{normal}.bam",
output:
"CoClean/{tumor}.sorted.dup.reca.cleaned.bam",
"CoClean/{normal}.sorted.dup.reca.cleaned.bam",
log:
"mapped_reads/merged_samples/logs/{tumor}.indel_realign_2.log"
threads: 8
shell:
"gatk -T IndelRealigner -R {params.genome} "
"-nt {threads} -I {input.t1} -I {input.n1} & {log} "
conditions.csv
$ more condition.csv
tumor normal
TimTum TimBorn
XYZ ABC
Pippo pippa
Thanks for any suggestion
I'm not convinced you have to include two input files to the GATK IndelRealigner. Building from that assumption, you can alter the rule to become indifferent to the "type (tumor vs normal)" of file it is process. I read the specs here. Please, if I am wrong, stop reading and correct me.
By changing the rule to be bam-type agnostic (made up word) you gain two advantages, and there is one main disadvantage.
Advantages:
Now we only have a single wild-card
We can run the alignment of each .bam file independently, which with a devoted CPU should hopefully make things faster.
Disadvantage:
The reason I think that the GATK documentation has it setup to accept multiple 'bam' files is because if you are just using it as a 1-off call you want to list all the files at the same time. We are not needing that since we are automating the call process. We're indifferent to 1 call or 100 calls.