What I'm trying to do is to run a lot of Alphafold runs. I followed some articles and able to successfully finish my first trial run but I realized it couldn't recognized the GPU I attached to the host from the Docker container. The commands that I used are as follow,
# dsub command
dsub --provider google-cls-v2 \
--project ${PROJECT_ID} \
--logging gs://$BUCKET/logs \
--image=$IMAGE \
--script=alphafold.sh \
--mount DB="${IMAGE_URL} 3000" \
--machine-type n1-standard-8 \
--boot-disk-size 100 \
--subnetwork ${SUBNET_NAME} \
--accelerator-type nvidia-tesla-k80 \
--accelerator-count 1 \
--preemptible \
--zones ${ZONE_NAMES} \
--tasks batch_tasks.tsv 1-2
# alphafold.sh
cd /app/alphafold
mkdir -p output
python run_alphafold.py \
--fasta_paths=${FASTA} \
--data_dir=${DB} \
--output_dir=output \
--use_gpu_relax=True \
--uniref90_database_path=${DB}/uniref90/uniref90.fasta \
--mgnify_database_path=${DB}/mgnify/mgy_clusters_2018_12.fa \
--template_mmcif_dir=${DB}/pdb_mmcif/mmcif_files \
--max_template_date=2020-05-14 \
--obsolete_pdbs_path=${DB}/pdb_mmcif/obsolete.dat \
--pdb70_database_path=${DB}/pdb70/pdb70 \
--uniclust30_database_path=${DB}/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
--bfd_database_path=${DB}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--db_preset=full_dbs \
--benchmark=False
tar zcvf ${OUT_PATH} output
And inside the stdout, I saw these few lines,
I0416 02:16:30.765444 140264211408704 tpu_client.py:54] Starting the local TPU driver.
I0416 02:16:30.766041 140264211408704 xla_bridge.py:212] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local://
I0416 02:16:31.607698 140264211408704 xla_bridge.py:212] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available.
I saw the log that it's trying to install GPU driver and put it inside the kernel with a image called cos-gpu-installer
from a Google registry before starting the user-command
container. But when I tried the nvidia-smi
command when the user-command
is still running,
docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
I got could not select device driver "" with capabilities: [[gpu]].
.
The Dockerfile that I used for building the docker image is from the Alphafold docker folder,
git clone https://github.com/deepmind/alphafold.git
cd alphafold
docker build -f docker/Dockerfile -t alphafold .
docker tag alphafold us.gcr.io/${PROJECT_ID}/alphafold
docker push us.gcr.io/${PROJECT_ID}/alphafold
Any suggestions on how to troubleshoot the issue would be appreciated.