Failed to use GPU to run Alphafold on GCP with dsub

1.1k Views Asked by At

What I'm trying to do is to run a lot of Alphafold runs. I followed some articles and able to successfully finish my first trial run but I realized it couldn't recognized the GPU I attached to the host from the Docker container. The commands that I used are as follow,

# dsub command
dsub --provider google-cls-v2 \
  --project ${PROJECT_ID} \
  --logging gs://$BUCKET/logs \
  --image=$IMAGE \
  --script=alphafold.sh \
  --mount DB="${IMAGE_URL} 3000" \
  --machine-type n1-standard-8 \
  --boot-disk-size 100 \
  --subnetwork ${SUBNET_NAME} \
  --accelerator-type nvidia-tesla-k80 \
  --accelerator-count 1 \
  --preemptible \
  --zones ${ZONE_NAMES} \
  --tasks batch_tasks.tsv 1-2

# alphafold.sh
cd /app/alphafold

mkdir -p output

python run_alphafold.py \
  --fasta_paths=${FASTA} \
  --data_dir=${DB} \
  --output_dir=output \
  --use_gpu_relax=True \
  --uniref90_database_path=${DB}/uniref90/uniref90.fasta \
  --mgnify_database_path=${DB}/mgnify/mgy_clusters_2018_12.fa \
  --template_mmcif_dir=${DB}/pdb_mmcif/mmcif_files \
  --max_template_date=2020-05-14 \
  --obsolete_pdbs_path=${DB}/pdb_mmcif/obsolete.dat \
  --pdb70_database_path=${DB}/pdb70/pdb70 \
  --uniclust30_database_path=${DB}/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
  --bfd_database_path=${DB}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
  --db_preset=full_dbs \
  --benchmark=False

tar zcvf ${OUT_PATH} output

And inside the stdout, I saw these few lines,

I0416 02:16:30.765444 140264211408704 tpu_client.py:54] Starting the local TPU driver.
I0416 02:16:30.766041 140264211408704 xla_bridge.py:212] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local://
I0416 02:16:31.607698 140264211408704 xla_bridge.py:212] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available.

I saw the log that it's trying to install GPU driver and put it inside the kernel with a image called cos-gpu-installer from a Google registry before starting the user-command container. But when I tried the nvidia-smi command when the user-command is still running,

docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

I got could not select device driver "" with capabilities: [[gpu]]..

The Dockerfile that I used for building the docker image is from the Alphafold docker folder,

git clone https://github.com/deepmind/alphafold.git
cd alphafold
docker build -f docker/Dockerfile -t alphafold .
docker tag alphafold us.gcr.io/${PROJECT_ID}/alphafold
docker push us.gcr.io/${PROJECT_ID}/alphafold

Any suggestions on how to troubleshoot the issue would be appreciated.

0

There are 0 best solutions below