Slowing down with multiple concurrent MPI LAMMPS jobs

451 Views Asked by At

I'm running the LAMMPS simulation with AMD 2990WX (Ubuntu 18.04).

When I run only one LAMMPS job using mpirun like below.

    #!/bin/sh

    LAMMPS_HOME=/APP/LAMMPS/src
    MPI_HOME=/APP/LIBS/OPENMPI2

    Tf=0.30

    $MPI_HOME/bin/mpirun -np 8 --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.020 -var Tf ${Tf}

I have no problem and the simulation go as I want.

But when I run the script below. Each LAMMPS job takes almost exactly 3 times of single LAMMPS job. Thus, I have no performance gain in parallel environment (since 3 jobs running in the speed of 1/3 of single job)

    #!/bin/sh

    LAMMPS_HOME=/APP/LAMMPS/src
    MPI_HOME=/APP/LIBS/OPENMPI2

    Tf=0.30

    $MPI_HOME/bin/mpirun -np 8 --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.020 -var Tf ${Tf} &
    $MPI_HOME/bin/mpirun -np 8 --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.025 -var Tf ${Tf} &
    $MPI_HOME/bin/mpirun -np 8 --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.030 -var Tf ${Tf}

Without host file my_host, It's the same. the hostfile is as below:

    <hostname> slots=32 max-slots=32

I installed openmpi with --with-cuda, fftw with --enable-shared, and LAMMPS with a few packages.

I have tried openmpi v1.8, v3.0, v4.0 and fftw v3.3.8. RAM is enough and storage is also enough. I also have checked load average and core usage. They show the machine uses 24 cores (or the corresponding load) when I run the second script. When I run concurrently duplicates of the first script in separate terminal (i.e. sh first.sh in each terminal), the same problem happens.

Is there any problem with my use of bash script? or Is there any known issue with mpirun (or LAMMPS) + Ryzen?

Update

I have tested following script:

/bin/sh

LAMMPS_HOME=/APP/LAMMPS/src
MPI_HOME=/APP/LIBS/OPENMPI2

Tf=0.30

$MPI_HOME/bin/mpirun --cpu-set 0-7 --bind-to core -np 8 --report-bindings --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.020 -var Tf ${Tf} &
$MPI_HOME/bin/mpirun --cpu-set 8-15 --bind-to core -np 8 --report-bindings --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.025 -var Tf ${Tf} &
$MPI_HOME/bin/mpirun --cpu-set 16-23 --bind-to core -np 8 --report-bindings --hostfile my_host $LAMMPS_HOME/lmp_lmp_mpi -in $PWD/../01_Annealing/in.01_Annealing -var MaxShear 0.030 -var Tf ${Tf}

And the result shows something like:

[<hostname>:09617] MCW rank 4 bound to socket 0[core 4[hwt 0-1]]: [../../../../BB/../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 5 bound to socket 0[core 5[hwt 0-1]]: [../../../../../BB/../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 6 bound to socket 0[core 6[hwt 0-1]]: [../../../../../../BB/../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 7 bound to socket 0[core 7[hwt 0-1]]: [../../../../../../../BB/../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09617] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../../../../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09619] MCW rank 4 bound to socket 0[core 20[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../BB/../../../../../../../../../../..]
[<hostname>:09619] MCW rank 5 bound to socket 0[core 21[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../BB/../../../../../../../../../..]
[<hostname>:09619] MCW rank 6 bound to socket 0[core 22[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../BB/../../../../../../../../..]
[<hostname>:09619] MCW rank 7 bound to socket 0[core 23[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../../../../../BB/../../../../../../../..]
[<hostname>:09619] MCW rank 0 bound to socket 0[core 16[hwt 0-1]]: [../../../../../../../../../../../../../../../../BB/../../../../../../../../../../../../../../..]
[<hostname>:09619] MCW rank 1 bound to socket 0[core 17[hwt 0-1]]: [../../../../../../../../../../../../../../../../../BB/../../../../../../../../../../../../../..]
[<hostname>:09619] MCW rank 2 bound to socket 0[core 18[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../BB/../../../../../../../../../../../../..]
[<hostname>:09619] MCW rank 3 bound to socket 0[core 19[hwt 0-1]]: [../../../../../../../../../../../../../../../../../../../BB/../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 4 bound to socket 0[core 12[hwt 0-1]]: [../../../../../../../../../../../../BB/../../../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 5 bound to socket 0[core 13[hwt 0-1]]: [../../../../../../../../../../../../../BB/../../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 6 bound to socket 0[core 14[hwt 0-1]]: [../../../../../../../../../../../../../../BB/../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 7 bound to socket 0[core 15[hwt 0-1]]: [../../../../../../../../../../../../../../../BB/../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 0 bound to socket 0[core 8[hwt 0-1]]: [../../../../../../../../BB/../../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 1 bound to socket 0[core 9[hwt 0-1]]: [../../../../../../../../../BB/../../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 2 bound to socket 0[core 10[hwt 0-1]]: [../../../../../../../../../../BB/../../../../../../../../../../../../../../../../../../../../..]
[<hostname>:09618] MCW rank 3 bound to socket 0[core 11[hwt 0-1]]: [../../../../../../../../../../../BB/../../../../../../../../../../../../../../../../../../../..]

I don't have much knowledge about MPI but, to me, it does not show anything odd. Is there any problem?

0

There are 0 best solutions below