Ray head node fails to connect when running ray up

189 Views Asked by At

I am trying to configure ray cluster with the following cluster.yaml configuration. For GCP, I am required to input the image. With 50gb images, the head and worker nodes start easily but with 10GB images, either the SSH connection does not establish or I am given the following error:

Cluster: minimal

2023-11-26 15:04:15,589 INFO util.py:375 -- setting max workers for head node type to 0
Checking GCP environment settings
2023-11-26 15:04:17,762 INFO config.py:556 -- _configure_key_pair: Private key not specified in config, using/Users/psr/.ssh/ray-autoscaler_gcp_us-west1_[project-id]_ubuntu_0.pem
No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]

Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Acquiring an up-to-date head node
2023-11-26 15:04:20,482 INFO node.py:321 -- wait_for_compute_zone_operation: Waiting for operation operation-1701032659383-60b148769f53e-930c3d35-1a8da44c to finish...
2023-11-26 15:04:30,905 INFO node.py:340 -- wait_for_compute_zone_operation: Operation operation-1701032659383-60b148769f53e-930c3d35-1a8da44c finished.
  Launched a new head node
  Fetching the new head node

<1/1> Setting up head node
  Prepared bootstrap config
2023-11-26 15:04:31,984 INFO node.py:321 -- wait_for_compute_zone_operation: Waiting for operation operation-1701032671517-60b1488231ad3-52fc0469-89cba5d4 to finish...
2023-11-26 15:04:37,446 INFO node.py:340 -- wait_for_compute_zone_operation: Operation operation-1701032671517-60b1488231ad3-52fc0469-89cba5d4 finished.
  New status: waiting-for-ssh
  [1/7] Waiting for SSH to become available
    Running `uptime` as a test.
    Fetched IP: 34.82.93.47
ssh: connect to host 34.82.93.47 port 22: Connection refused
    SSH still not available (SSH command failed.), retrying in 5 seconds.
Warning: Permanently added '34.82.93.47' (ED25519) to the list of known hosts.
"System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8)."
Connection closed by 34.82.93.47 port 22
    SSH still not available (SSH command failed.), retrying in 5 seconds.
Warning: Permanently added '34.82.93.47' (ED25519) to the list of known hosts.
"System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8)."
Connection closed by 34.82.93.47 port 22
    SSH still not available (SSH command failed.), retrying in 5 seconds.
Warning: Permanently added '34.82.93.47' (ED25519) to the list of known hosts.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

 21:05:03 up 0 min,  1 user,  load average: 0.47, 0.12, 0.04
Shared connection to 34.82.93.47 closed.
    Success.
  Updating cluster configuration. [hash=c5d2c90d9c700fb411193fd7adeaba4698f29fd6]
2023-11-26 15:05:04,415 INFO node.py:321 -- wait_for_compute_zone_operation: Waiting for operation operation-1701032704136-60b148a14d67e-76f6c46f-d515a79b to finish...
2023-11-26 15:05:09,741 INFO node.py:340 -- wait_for_compute_zone_operation: Operation operation-1701032704136-60b148a14d67e-76f6c46f-d515a79b finished.
  New status: syncing-files
  [2/7] Processing file mounts
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

Shared connection to 34.82.93.47 closed.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

Shared connection to 34.82.93.47 closed.
  [3/7] No worker file mounts to sync
2023-11-26 15:05:11,818 INFO node.py:321 -- wait_for_compute_zone_operation: Waiting for operation operation-1701032711498-60b148a852c8e-496dc400-5656f8c3 to finish...
2023-11-26 15:05:17,257 INFO node.py:340 -- wait_for_compute_zone_operation: Operation operation-1701032711498-60b148a852c8e-496dc400-5656f8c3 finished.
  New status: setting-up
  [4/7] No initialization commands to run.
  [5/7] Initializing command runner
  [6/7] Running setup commands
    (0/3) (stat /opt/conda/bin/ &> /dev/...
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

Shared connection to 34.82.93.47 closed.
    (1/3) which ray || pip install -U "r...
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.

Command 'pip' not found, but can be installed with:
sudo apt install python3-pip
Shared connection to 34.82.93.47 closed.
2023-11-26 15:05:18,829 INFO node.py:321 -- wait_for_compute_zone_operation: Waiting for operation operation-1701032718577-60b148af12dfd-0e1364e0-202ae734 to finish...
2023-11-26 15:05:24,431 INFO node.py:340 -- wait_for_compute_zone_operation: Operation operation-1701032718577-60b148af12dfd-0e1364e0-202ae734 finished.
  New status: update-failed
  !!!
  SSH command failed.
  !!!

  Failed to setup head node.

The following is my cluster configuration:

# A unique identifier for the head node and workers of this cluster.
cluster_name: minimal
max_workers: 1
# Cloud-provider specific configuration.
provider:
    type: gcp
    region: us-west1
    project_id: [project_id]
    availability_zone: us-west1-a

available_node_types:
  ray_head_default:
    node_config:
      machineType: e2-standard-4
      disks:
        - boot: true
          autoDelete: true
          initializeParams:
            diskSizeGb: 20
            sourceImage: projects/ubuntu-os-cloud/global/images/ubuntu-2304-lunar-amd64-v20231030
            diskType: projects/ubuntu-os-cloud/zones/us-west1-a/diskTypes/pd-ssd
      serviceAccounts:
        - email: ray-autoscaler-sa-v1@[project_id].iam.gserviceaccount.com
    resources:
      CPU: 2

  ray_worker_default:
    min_workers: 0
    max_workers: 1
    node_config:
      machineType: e2-standard-4
      disks:
        - boot: true
          autoDelete: true
          initializeParams:
            diskSizeGb: 20
            sourceImage: projects/ubuntu-os-cloud/global/images/ubuntu-2304-lunar-amd64-v20231030
            diskType: projects/ubuntu-os-cloud/zones/us-west1-a/diskTypes/pd-ssd
      serviceAccounts:
        - email: ray-autoscaler-sa-v1@[project_id].iam.gserviceaccount.com
    resources:
      CPU: 2

head_node_type: ray_head_default

Can you please tell me what am I doing wrong?

0

There are 0 best solutions below