I have an hg19-aligned BAM that I wish to generate a DeepVariant VCF for. I used samtools to extract the header and ensured that the hg19 reference FASTA index includes the same contigs and locations. My original goal was to run only an exome model on this WGS BAM, using the following model and regions:
MODEL=gs://deepvariant/models/DeepVariant/0.7.2/DeepVariant-inception_v3-0.7.2+data-wes_standard
--regions gs://deepvariant/exome-case-study-testdata/refseq.coding_exons.b37.extended50.bed
Unfortunately, the script protested saying that there were 0 matches between the BED and the BAM / FASTA reference. I decided to run the same exome model but without the regions specified. Here is my script:
#!/bin/bash
set -euo pipefail
# Set common settings.
PROJECT_ID=<MY PROJECT>
OUTPUT_BUCKET=gs://<MY BUCKET>
STAGING_FOLDER_NAME=staging
OUTPUT_FILE_NAME=output.vcf
# Model for calling whole genome sequencing data.
MODEL=gs://deepvariant/models/DeepVariant/0.7.2/DeepVariant-inception_v3-0.7.2+data-wes_standard
IMAGE_VERSION=0.7.2
DOCKER_IMAGE=gcr.io/deepvariant-docker/deepvariant:"${IMAGE_VERSION}"
COMMAND="/opt/deepvariant_runner/bin/gcp_deepvariant_runner \
--project ${PROJECT_ID} \
--zones us-west1-* \
--docker_image ${DOCKER_IMAGE} \
--outfile ${OUTPUT_BUCKET}/${OUTPUT_FILE_NAME} \
--staging ${OUTPUT_BUCKET}/${STAGING_FOLDER_NAME} \
--model ${MODEL} \
--bam gs://my-bucket/wgs_data.bam \
--ref gs://my-bucket/human_g1k_v37.fa \
--shards 512 \
--make_examples_workers 32 \
--make_examples_cores_per_worker 16 \
--make_examples_ram_per_worker_gb 60 \
--make_examples_disk_per_worker_gb 200 \
--call_variants_workers 32 \
--call_variants_cores_per_worker 32 \
--call_variants_ram_per_worker_gb 60 \
--call_variants_disk_per_worker_gb 50 \
--gcsfuse"
# Run the pipeline.
gcloud alpha genomics pipelines run \
--project "${PROJECT_ID}" \
--service-account-scopes="https://www.googleapis.com/auth/cloud-platform" \
--logging "${OUTPUT_BUCKET}/${STAGING_FOLDER_NAME}/runner_logs_$(date +%Y%m%d_%H%M%S).log" \
--regions us-west1 \
--docker-image gcr.io/cloud-genomics-pipelines/gcp-deepvariant-runner \
--command-line "${COMMAND}"
The BAM has a corresponding BAI and the FA has an FAI file. The DeepVariant QuickStart indicates these settings will produce a VCF in 1-2 hours, but my pipeline has been running for over 7 hours now. The staging folder now has a call_variants with what appear to be 31 out of 32 total GZ files. The Genomics pipelines view shows 11 pipelines running call_variant, so I suspect it's working on that last file in preparation to combine then all into a single VCF.
I just do not understand why this is taking so long. I have purposely excluded preemptive instances and the documentation says an exome pipeline should take only 20 minutes (with WGS at 1-2 hours). Why might this be so slow?
The run time you are seeing is certainly slower than expected for DeepVariant.
One observation at the start - The coordinates for the exome capture BED (refseq.coding_exons.b37.extended50.bed) and the reference (human_g1k_v37.fa) should match. Do you know what reference genome your BAM is mapped to? Just to confirm, in your FASTA file, the first line should read: >1 with no "chr".
The expected time should be <1 hour when using a regions file.
Second, can I ask you to try running the exome case study on a single machine, following the instructions from this page:
https://github.com/google/deepvariant/blob/r0.8/docs/deepvariant-exome-case-study.md
Running this will help determine whether the issue you are seeing has something to do with DeepVariant itself, or if it relates to the GCP cloud implementation which is separate from the program.
Thank you, Andrew