I am trying to configure ray cluster with the following cluster.yaml
configuration. For GCP, I am required to input the image. With 50gb images, the head and worker nodes start easily but with 10GB images, either the SSH connection does not establish or I am given the following error:
Cluster: minimal
2023-11-26 15:04:15,589 INFO util.py:375 -- setting max workers for head node type to 0
Checking GCP environment settings
2023-11-26 15:04:17,762 INFO config.py:556 -- _configure_key_pair: Private key not specified in config, using/Users/psr/.ssh/ray-autoscaler_gcp_us-west1_[project-id]_ubuntu_0.pem
No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
Acquiring an up-to-date head node
2023-11-26 15:04:20,482 INFO node.py:321 -- wait_for_compute_zone_operation: Waiting for operation operation-1701032659383-60b148769f53e-930c3d35-1a8da44c to finish...
2023-11-26 15:04:30,905 INFO node.py:340 -- wait_for_compute_zone_operation: Operation operation-1701032659383-60b148769f53e-930c3d35-1a8da44c finished.
Launched a new head node
Fetching the new head node
<1/1> Setting up head node
Prepared bootstrap config
2023-11-26 15:04:31,984 INFO node.py:321 -- wait_for_compute_zone_operation: Waiting for operation operation-1701032671517-60b1488231ad3-52fc0469-89cba5d4 to finish...
2023-11-26 15:04:37,446 INFO node.py:340 -- wait_for_compute_zone_operation: Operation operation-1701032671517-60b1488231ad3-52fc0469-89cba5d4 finished.
New status: waiting-for-ssh
[1/7] Waiting for SSH to become available
Running `uptime` as a test.
Fetched IP: 34.82.93.47
ssh: connect to host 34.82.93.47 port 22: Connection refused
SSH still not available (SSH command failed.), retrying in 5 seconds.
Warning: Permanently added '34.82.93.47' (ED25519) to the list of known hosts.
"System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8)."
Connection closed by 34.82.93.47 port 22
SSH still not available (SSH command failed.), retrying in 5 seconds.
Warning: Permanently added '34.82.93.47' (ED25519) to the list of known hosts.
"System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8)."
Connection closed by 34.82.93.47 port 22
SSH still not available (SSH command failed.), retrying in 5 seconds.
Warning: Permanently added '34.82.93.47' (ED25519) to the list of known hosts.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
21:05:03 up 0 min, 1 user, load average: 0.47, 0.12, 0.04
Shared connection to 34.82.93.47 closed.
Success.
Updating cluster configuration. [hash=c5d2c90d9c700fb411193fd7adeaba4698f29fd6]
2023-11-26 15:05:04,415 INFO node.py:321 -- wait_for_compute_zone_operation: Waiting for operation operation-1701032704136-60b148a14d67e-76f6c46f-d515a79b to finish...
2023-11-26 15:05:09,741 INFO node.py:340 -- wait_for_compute_zone_operation: Operation operation-1701032704136-60b148a14d67e-76f6c46f-d515a79b finished.
New status: syncing-files
[2/7] Processing file mounts
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
Shared connection to 34.82.93.47 closed.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
Shared connection to 34.82.93.47 closed.
[3/7] No worker file mounts to sync
2023-11-26 15:05:11,818 INFO node.py:321 -- wait_for_compute_zone_operation: Waiting for operation operation-1701032711498-60b148a852c8e-496dc400-5656f8c3 to finish...
2023-11-26 15:05:17,257 INFO node.py:340 -- wait_for_compute_zone_operation: Operation operation-1701032711498-60b148a852c8e-496dc400-5656f8c3 finished.
New status: setting-up
[4/7] No initialization commands to run.
[5/7] Initializing command runner
[6/7] Running setup commands
(0/3) (stat /opt/conda/bin/ &> /dev/...
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
Shared connection to 34.82.93.47 closed.
(1/3) which ray || pip install -U "r...
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
Command 'pip' not found, but can be installed with:
sudo apt install python3-pip
Shared connection to 34.82.93.47 closed.
2023-11-26 15:05:18,829 INFO node.py:321 -- wait_for_compute_zone_operation: Waiting for operation operation-1701032718577-60b148af12dfd-0e1364e0-202ae734 to finish...
2023-11-26 15:05:24,431 INFO node.py:340 -- wait_for_compute_zone_operation: Operation operation-1701032718577-60b148af12dfd-0e1364e0-202ae734 finished.
New status: update-failed
!!!
SSH command failed.
!!!
Failed to setup head node.
The following is my cluster configuration:
# A unique identifier for the head node and workers of this cluster.
cluster_name: minimal
max_workers: 1
# Cloud-provider specific configuration.
provider:
type: gcp
region: us-west1
project_id: [project_id]
availability_zone: us-west1-a
available_node_types:
ray_head_default:
node_config:
machineType: e2-standard-4
disks:
- boot: true
autoDelete: true
initializeParams:
diskSizeGb: 20
sourceImage: projects/ubuntu-os-cloud/global/images/ubuntu-2304-lunar-amd64-v20231030
diskType: projects/ubuntu-os-cloud/zones/us-west1-a/diskTypes/pd-ssd
serviceAccounts:
- email: ray-autoscaler-sa-v1@[project_id].iam.gserviceaccount.com
resources:
CPU: 2
ray_worker_default:
min_workers: 0
max_workers: 1
node_config:
machineType: e2-standard-4
disks:
- boot: true
autoDelete: true
initializeParams:
diskSizeGb: 20
sourceImage: projects/ubuntu-os-cloud/global/images/ubuntu-2304-lunar-amd64-v20231030
diskType: projects/ubuntu-os-cloud/zones/us-west1-a/diskTypes/pd-ssd
serviceAccounts:
- email: ray-autoscaler-sa-v1@[project_id].iam.gserviceaccount.com
resources:
CPU: 2
head_node_type: ray_head_default
Can you please tell me what am I doing wrong?