Socket between submit and execute hosts closed unexpectedly

1k Views Asked by At

I am attempting to run a SAS file on a cluster. The contents of the SAS file myprogram.sas are shown below:

data a;
   input myvar1;
   myvar2 = myvar1 + 100 ;
   datalines;
       0
       1
       2
       3
       4
       5
;
proc print;
run;

I create a Condor file to execute the SAS file on the cluster. The contents of the Condor file mycondorcode.condor are shown below, except that I have altered the email address:

####################
#
# Submit SAS code to Condor cluster
#
# Submit this to run on the cluster with condor_submit THIS-FILENAME.condor
#
####################

UNIVERSE                = vanilla
NOTIFICATION            = Complete
NOTIFY_USER             = [email protected]

REQUIREMENTS            = (OpSys == "LINUX" && HAS_SAS )
GETENV                  = TRUE

EXECUTABLE              = /usr/local/bin/sas
ARGUMENTS               = -nodms -noterminal
INPUT                   = myprogram.sas
OUTPUT                  = $(INPUT).out
ERROR                   = $(INPUT).err
LOG                     = $(INPUT).log

QUEUE

I copy the SAS and Condor files to the cluster using an application called WinSCP.exe which I guess converts the SAS file to a format the cluster can understand, I guess like a dos2unix command.

Then I submit the SAS file to the cluster using PuTTY by typing:

condor_submit mycondorcode.condor

When I type:

condor_q

I see:

 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
58683.0   markm          11/24 14:41   0+00:00:00 I  0   0.0  sas -nodms -noterm

Status (ST) remains I no matter how long I wait.

I can see a text file in my directory called myprogram.sas which contains the following (except that I have altered the email address and altered the number that looks like it could be an IP address):

000 (58683.000.000) 11/24 14:41:55 Job submitted from host: <14.4.104.1:42259>
...
022 (58683.000.000) 11/24 14:42:56 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to [email protected] <14.4.104.23:50176>
...
024 (58683.000.000) 11/24 14:42:56 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to [email protected], rescheduling job
...
022 (58683.000.000) 11/24 14:43:56 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to [email protected] <14.4.104.23:50176>
...
024 (58683.000.000) 11/24 14:43:56 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to [email protected], rescheduling job
...
022 (58683.000.000) 11/24 14:44:56 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to [email protected] <14.4.104.23:50176>
...
024 (58683.000.000) 11/24 14:44:56 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to [email protected], rescheduling job
...
022 (58683.000.000) 11/24 14:45:57 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to [email protected] <14.4.104.23:50176>
...
024 (58683.000.000) 11/24 14:45:57 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to [email protected], rescheduling job
...

I have never successfully used this cluster, but have run R on a different cluster. I know virtually nothing more about the current cluster. Based on what I have provided above does it appear that I am doing something incorrectly, or does it appear that there is a connection problem which must be addressed by the IT department who operates the cluster?

Thank you for any suggestions I might try to resolve this problem from my Windows desktop side while being almost entirely unfamiliar with Unix or clusters in general. Perhaps I am doing something incorrectly with WinSCP.exe. Perhaps instead of using WinSCP I might try using dos2unix?

0

There are 0 best solutions below