I have some python code which takes approximately 12 hours to run on my laptop (MacOS 16GB 2133 MHz LPDDR3). The code is looping over a few thousand iterations and doing some intensive processing at each step so it makes sense to parallelise the problem with MPI processing. I have access to a slurm cluster, where I have built mpi4py (for python 2.7) against their OpenMPI implementation with mpicc. I then submit the following submission script with sbatch --exclusive mysub.sbatch
:
#!/bin/bash
#SBATCH -p par-multi
#SBATCH -n 50
#SBATCH --mem-per-cpu=8000
#SBATCH -t 48:00:00
#SBATCH -o %j.log
#SBATCH -e %j.err
module add eb/OpenMPI/gcc/3.1.1
mpirun python ./myscript.py
which should split the tasks across 50 processors, each of which with an 8GB memory allocation. My code does something like the following:
import numpy as np
import pickle
import mpi4py
COMM = MPI.COMM_WORLD
def split(container, count):
return [container[_i::count] for _i in range(count)]
def read():
#function which reads a series of pickle files from my home directory
return data
def function1():
#some process 1
return f1
def function2():
#some process 2
return f2
def main_function(inputs):
#some process which also calls function1 and function2
f1 = function1(inputs)
f2 = function2(f1)
result = #some more processing
return result
### define global variables and read data ###
data = read()
N = 5000
#etc...
selected_variables = range(N)
if COMM.rank == 0:
splitted_jobs = split(selected_variables, COMM.size)
else:
splitted_jobs = None
scattered_jobs = COMM.scatter(splitted_jobs, root=0)
results = []
for index in scattered_jobs:
outputs = main_function(data[index])
results.append(outputs)
results = COMM.gather(results, root=0)
if COMM.rank == 0:
all_results = []
for r in results:
all_results.append(r)
f = open('result.pkl','wb')
pickle.dump(np.array(all_results),f,protocol=2)
f.close()
The maximum run time I can allocate for my job is 48 hours, at which point the job has not even finished running. Could anyone tell me if there is something in either my submission script or my code which is likely causing this to be very slow?
Thanks