pxssh does not work between compute nodes in a slurm cluster

88 Views Asked by At

I'm using the following script for connecting two compute nodes in a slurm cluster.

from getpass import getuser
from socket import gethostname
from pexpect import pxssh
import sys 

python = sys.executable
worker_command = "%s -m worker" % python + " %i " + server_socket
pid = 0
children = []
for node, ntasks in node_list.items():
        if node == gethostname():
                continue
        if node != gethostname():
                pid_range = range(pid, pid + ntasks)
                pid += ntasks
                ssh = pxssh.pxssh()
                ssh.login(node, getuser())
                for worker in pid_range:
                        ssh.sendline(worker_command % worker + '&')
                children.append(ssh)

node_list is a dictionary {'cn000': 28, 'cn001': 28}. worker is a python file placed in the working dictionary.

I expect ssh.sendline to be the same as pexpect.spawn. However, nothing happened after I ran the script.

Although an ssh session was built by ssh.login(node, getuser()), it seems the line ssh.sendline(worker_command % worker) has no effect, because the script to be run by worker_command is not run.

How can I fix this? Or should I try something else?

How can I create one socket on one compute node and connect it with a socket on another compute node?

1

There are 1 best solutions below

2
On

There is missing a '%s' from the content of worker_command. It contains something like this: "/usr/bin/python3 -m worker" -> worker_command%worker should result in error.

If not (it is possible, because this source looks like a short part of the original program), then add ">>workerprocess.log 2>&1" string before '&', then try to run your program and take a look at workerprocess.log on the server! If your $HOME is writable on the server, you should find the error message(s) in it.