MS MPI Permission errors

2.6k Views Asked by At

I have two machines both with MS MPI 7.1 installed, one called SERVER and one called COMPUTE. The machines are set up on LAN in a simple windows workgroup (No DA), and both have an account with the same name and password.

Both are running the MSMPILaunchSvc service. Both machines can execute MPI jobs locally, verified by testing with the hostname command

SERVER> mpiexec -hosts 1 SERVER 1 hostname
SERVER
or
COMPUTE> mpiexec -hosts 1 COMPUTE 1 hostname
COMPUTE

in a terminal on the machines themselves.

I have disabled the firewall on both machines to make things easier.

My problem is I can not get MPI to run jobs from SERVER on a remote host:

1: SERVER with MSMPILaunchSvc -> COMPUTE with MSMPILaunchSvc

SERVER> mpiexec -hosts 1 COMPUTE 1 hostname -pwd
ERROR: Failed RpcCliCreateContext error 1722

Aborting: mpiexec on SERVER is unable to connect to the smpd service on COMPUTE:8677
Other MPI error, error stack:
connect failed - The RPC server is unavailable.  (errno 1722)

What's even more frustrating here is that only sometimes I get prompted to enter a password. It suggests SERVER\Maarten as the user for COMPUTE, the account I am already logged in as on SERVER and shouldn't exist on COMPUTE (should be COMPUTE\Maarten then?). Nonetheless it also fails:

SERVER>mpiexec -hosts 1 COMPUTE 1 hostname.exe -pwd
Enter Password for SERVER\Maarten:
Save Credentials[y|n]? n
ERROR: Failed to connect to SMPD Manager Instance error 1726

Aborting: mpiexec on SERVER is unable to connect to the 
smpd manager on COMPUTE:50915 error 1726

2: COMPUTE with MSMPILaunchSvc -> SERVER with MSMPILaunchSvc

COMPUTE> mpiexec -hosts 1 SERVER 1 hostname -pwd
ERROR: Failed RpcCliCreateContext error 5

Aborting: mpiexec on COMPUTE is unable to connect to the smpd service on SERVER:8677
Other MPI error, error stack:
connect failed - Access is denied.  (errno 5)

3: COMPUTE with MSMPILaunchSvc -> SERVER with smpd daemon

Aborting: mpiexec on COMPUTE is unable to connect to the smpd service on  SERVER:8677
Other MPI error, error stack:
connect failed - Access is denied.  (errno 5)

4: SERVER with MSMPILaunchSvc -> COMPUTE with smpd daemon

ERROR: Failed to connect to SMPD Manager Instance error 1726

Aborting: mpiexec on SERVER is unable to connect to the smpd manager on 
COMPUTE:51022 error 1726

Update:

Trying with smpd daemon on both nodes I get this error:

[-1:9796] Authentication completed. Successfully obtained Context for Client.
[-1:9796] version check complete, using PMP version 3.
[-1:9796] create manager process (using smpd daemon credentials)
[-1:9796] smpd reading the port string from the manager
[-1:9848] Launching smpd manager instance.
[-1:9848] created set for manager listener, 376
[-1:9848] smpd manager listening on port 51149
[-1:9796] closing the pipe to the manager
[-1:9848] Authentication completed. Successfully obtained Context for Client.
[-1:9848] Authorization completed.
[-1:9848] version check complete, using PMP version 3.
[-1:9848] Received session header from parent id=1, parent=0, level=0
[01:9848] Connecting back to parent using host SERVER and endpoint 17979
[01:9848] Previous attempt failed with error 5, trying to authenticate without Kerberos
[01:9848] Failed to connect back to parent error 5.
[01:9848] ERROR: Failed to connect back to parent 'ncacn_ip_tcp:SERVER:17979' error 5
[01:9848] smpd manager successfully stopped listening.
[01:9848] SMPD exiting with error code 4294967293.

and on the host:

[-1:12264] Launching SMPD service.
[-1:12264] smpd listening on port 8677
[-1:12264] Authentication completed. Successfully obtained Context for Client.
[-1:12264] version check complete, using PMP version 3.
[-1:12264] create manager process (using smpd daemon credentials)
[-1:12264] smpd reading the port string from the manager
[-1:16668] Launching smpd manager instance.
[-1:16668] created set for manager listener, 364
[-1:16668] smpd manager listening on port 18033
[-1:12264] closing the pipe to the manager
[-1:16668] Authentication completed. Successfully obtained Context for Client.
[-1:16668] Authorization completed.
[-1:16668] version check complete, using PMP version 3.
[-1:16668] Received session header from parent id=1, parent=0, level=0
[01:16668] Connecting back to parent using host SERVER and endpoint 18031
[01:16668] Authentication completed. Successfully obtained Context for Client.
[01:16668] Authorization completed.
[01:16668] handling command SMPD_CONNECT src=0
[01:16668] now connecting to COMPUTE
[01:16668] 1 -> 2 : returning SMPD_CONTEXT_LEFT_CHILD
[01:16668] using spn msmpi/COMPUTE to contact server
[01:16668] SERVER posting a re-connect to COMPUTE:51161 in left child context.
[01:16668] ERROR: Failed to connect to SMPD Manager Instance error 1726
[01:16668] sending abort command to parent context.
[01:16668] posting command SMPD_ABORT to parent, src=1, dest=0.
[01:16668] ERROR: smpd running on SERVER is unable to connect to smpd service on COMPUTE:8677
[01:16668] Handling cmd=SMPD_ABORT result
[01:16668] cmd=SMPD_ABORT result will be handled locally
[01:16668] parent terminated unexpectedly - initiating cleaning up.
[01:16668] no child processes to kill - exiting with error code -1
1

There are 1 best solutions below

0
On BEST ANSWER

I found after trial and error that these and other unspecific errors come up when trying to run MS MPI with different configurations (in my case a mix of HPC Cluster 2008 and HPC Cluster 2012 with MSMPI).

The solution was to downgrade all nodes to Windows Server 2008 R2 with HPC Cluster 2008. Because I dont use AD, I had to fall back to using the SMPD daemon and add firewall rules for it (skipping the cluster management tools alltogether).