mpirun running job serially with only one core

336 Views Asked by At

I have installed mpich4.1 in ubuntu machine using GNU compiler. In the beginning I ran one job successfully using mpirun on '36' cores, but now when I'm trying to run same job it's running serially using only one core. Now the command output of mpirun -np 36 ./wrf.exe is

 starting wrf task            0  of            1
 starting wrf task            0  of            1
 starting wrf task            0  of            1
 starting wrf task            0  of            1
 starting wrf task            0  of            1
 starting wrf task            0  of            1

The mpivars gives error with

Abort(470406415): Fatal error in internal_Init_thread: Other MPI error, error stack:
internal_Init_thread(67): MPI_Init_thread(argc=0x7fff8044f34c, argv=0x7fff8044f340, required=0, provided=0x7fff8044f350) failed
MPII_Init_thread(222)...:  gpu_init failed

But the machine is not having GPU. The mpi version command gives

HYDRA build details:
    Version:                                 4.1
    Release Date:                            Fri Jan 27 13:54:44 CST 2023
    CC:                              gcc      
    Configure options:                       '--disable-option-checking' '--prefix=/home/MODULES' '--cache-file=/dev/null' '--srcdir=.' 'CC=gcc' 'CFLAGS= -O2' 'LDFLAGS=' 'LIBS=' 'CPPFLAGS= -DNETMOD_INLINE=__netmod_inline_ofi__ -I/home/MODULES/mpich-4.1/src/mpl/include -I/home/MODULES/mpich-4.1/modules/json-c -D_REENTRANT -I/home/MODULES/mpich-4.1/src/mpi/romio/include -I/home/MODULES/mpich-4.1/src/pmi/include -I/home/MODULES/mpich-4.1/modules/yaksa/src/frontend/include -I/home/MODULES/mpich-4.1/modules/libfabric/include'
    Process Manager:                         pmi
    Launchers available:                     ssh rsh fork slurm ll lsf sge manual persist
    Topology libraries available:            hwloc
    Resource management kernels available:   user slurm ll lsf sge pbs cobalt
    Demux engines available:                 poll select

What could be the possible reason for this?

Thanks in advance.

0

There are 0 best solutions below