RegCM, MPICH, Computer Clustering

Question

RegCM, MPICH, Computer Clustering

355 Views Asked by anup At 18 August 2025 at 08:13

Background: I need to run a huge computation for climate simulation with over 800 [GB] of data ( for the past 50 years and future 80 years ).

For this, I'm using RegCM4 based in linux. I am using Ubuntu. The most powerful system we have has some Intel XEON processor with 20 cores. Also we have almost 20 smaller less powerful Intel i7 octa-core processors.

To run the simulations, the single system will take more than a month.

So, I've been trying to set up computer clusters with available resources.
(FYI: RegCM allows parallel processing with mpi.)

Specs::

Computer socket cores_per_socket threads_per_core CPUs   RAM   Hard_drives 
node0    1      10               2                20   32 GB   256 GB  + 2 TB-HDD
node1    1       4               2                 8    8 GB             1 TB-HDD
node2    1       4               2                 8    8 GB             1 TB-HDD

-> I use mpich v3 ( I don't remember exact version no. )

And so on... ( all the nodes other than node0 are the same as node1.)
All nodes have 1 Gbps supported ethernet cards.
For test purpose I have set up a small simulation work for analyzing 6 days of climate. All test simulations used same parameters and model settings.

All nodes boot from their own HDD.
node0 runs on Ubuntu 16.04 LTS.
other nodes run Ubuntu 14.04 LTS.

How I started? I followed steps as in here.

Connected node1 and node2 with a Cat 6 cable, assigned them static IP-s. (left node0 for now) - edited /etc/hosts with IP-s and corresponding names - node1 and node2 as given in table above
setup password-less login with ssh in both - success
created a folder in /home/user in node1 (which will be master in this test) and exported the folder ( /etc/exports ), mounted this folder over NFS to node2 and edited /etc/fstab in node2 - success
Ran my regcm over the cluster using 14 cores of both machines - success
I have used : iotop, bmon, htop to monitor disk read/write, network traffic and CPU usage respectively.

$ mpirun -np 14 -hosts node0,node1 ./bin/regcmMPI test.in

Result of this test
Faster computation over a single node processing

Now I tried the same with node0 (see above for computer specs)

-> I am working on SSD in node0.
-> works fine but the problem is time factor when connected in cluster.

Here's the summary of results:: - first using node0 only - no use of cluster

$ mpirun -np 20 ./bin/regcmMPI test.in

nodes   no.of_cores_used    run_time_reported_by_regcm_in_sec   actual time taken in sec (approx)
node0   20                  59.58                                60
node0   16                  65.35                                66
node0   14                  73.33                                74

this is okay

Now, using cluster
_{( use following ref to understand the table below )}:

rt = CPU run time reported by regcm in sec

a-rt = actual time taken in sec (approx)

LAN = Max LAN speed achieved (Receive/Send) in MBps

disk(0 / 1) = Max Disk write speed at node0 / at node1 in MBps

nodes*  cores   rt      a-rt    LAN     disk(  0 /  1 )
1,0    16       148     176     100/30        90 / 50
0,1    16       145     146      30/30         6 /  0
1,0    14       116     143     100/25        85 / 75
0,1    14       121     121      20/20         7 /  0

*note:

1,0 (eg. for 16 cores) means: $ mpirun -np 16 -hosts node1,node0 ./bin/regcmMPI test.in

0,1 (eg. for 16 cores) means: $ mpirun -np 16 -hosts node0,node1 ./bin/regcmMPI test.in

Actual run time was calculated manually using start and end time reported by regcm.

We can see above that LAN-usage and drive write speed was significantly different for two options - 1. passing node1,node0 as host ; and 2. passing node0,node1 as host ---- note the order.

Also time for running in single node is faster than running in cluster. Why ?

I also ran another set of test, this time using hostfile (named hostlist) whose content were:

node0:16
node1:6

Now I ran the following script

$ mpirun -np 22 -f hostlist ./bin/regcmMPI test.in

CPU run time was reported 101 [s], actual run time was 1 min 42 sec ( 102 [s] ), LAN speed achieved was around 10-15 [MB/s], disk write speed was around 7 [MB/s].

The best result was obtained when I used the same hostfile setting and ran code with 20 processors thus under-subscribing

$ mpirun -np 20 -f hostlist ./bin/regcmMPI test.in

CPU runtime     : 90 [s]
Actual run time : 91 [s]
LAN             : 10 [MB/s]

When I changed cores from 20 downto 18, run time increased to 102 [s].

I have not yet connected node2 to the system.

Questions:

Is there a way to achieve faster speed in computation ? Am I doing something wrong ?
The computation time for single machine with 14 cores is faster than cluster with 22 cores or 20 cores. Why is it happening ?
What is the optimum number of cores that can be used to achieve time efficiency ?
How can I achieve best performance with available resources?
Are there any best mpich usage manual that can answer my questions? (I could not find any such info)
Sometimes using fewer cores give faster completion time than using higher cores even though I am not using all available cores leaving 1 or 2 cores for OS and other operations in individual nodes. Why is it happening?

Original Q&A

There are 1 best solutions below

**user3666197** · Answer 1

_{While the above commented advice to contact a regional or a national HPC centre is fair and worth to follow, I can imagine, how hard it could get to receive some remarkable processing-quota granted, if both deadlines and budget are moving against you}

INTRO:
Simplified Answers to Questions on a yet hidden complex-system:

1:
Is there a way to achieve faster speed in computation ?
Yes.
Am I doing something wrong ?
Not directly.

2:
The computation time for single machine with 14 cores is faster than cluster with 22 cores or 20 cores. Why is it happening ?
You pay more than you get. That easy. The NFS - a network-distributed abstraction of a filesystem is possible, but you pay immense costs for an ease to use it, if performance starts to become an ultimate target. In general all the lumpsum of all sorts of extra-costs paid on ( data distribution + the high add-on overheads ) is any higher than a net-effect of the [PARALLEL]-distributable workload-split over a yet low amount of CPU_cores, an actual slowdown instead of speedup appears. This is a common main suspect ( not mentioning switching off the hyper-threading in BIOS per se for computing intensive workloads ).

3:
What is the optimum number of cores that can be used to achieve time efficiency ?
First identify the biggest process-bottleneck observed{ CPU | MEMORY | LAN | fileIO | a-poor-algorithm }, only next seek a best step to improve speed ( keep iterative moving forwards on this { cause: remedy }-chain, while performance still grows ). Never attempt to go in a reversed order.

4:
How can I achieve best performance with available resources ?
This one is the most interesting and will require more work to be done ( ref. below ).

5:
Are there any best mpich usage manual that can answer my questions ?
There is no such colour of a LAN-cable, that would decide about its actual speed and performance or that would assure its suitability for some specific use, but the overall system architecture does matter.

6:
Sometimes using fewer cores give faster completion time than using higher cores even though I am not using all available cores leaving 1 or 2 cores for OS and other operations in individual nodes. Why is it happening ?
Ref. [ item 2 above ]

SOLUTION:
What one can always do about this design dilemma?

Before doing any step further, do your best to try to well understand both the original Amdahl's Law + its new overhead-strict re-formulation

Without mastering this basis, nothing else will help you decide the performance-hunt-dilemma-( duality )-of-fair-accounting-of-both-{ -costs +benefits }

A narrow view:
prefer tests to guesses. ( +1 for running tests )
mpich is a tool for your code going distributed and for all the related process-management and synchronisations. While weather simulations may enjoy a well established locality of influence ( less inter-process communication and synchronisations are mandatory, but the actual code decides on how many actually do appear to happen ) still the costs of data-related transfers will dominate ( ref. below for orders of magnitude ). If you cannot modify the code, you have to live with it and may just try to change the hardware, that mediates the flow ( from a 1 Gbps interface to 10 GBE to 40 GBE fabrics, if benchmarking tests support that and budget permits ).

If you can change the code, take a look on sample Test-cases demonstrated as a methodology for a principal approach to isolate the root-cause of the actual bottleneck and keep the { cause: remedy, ... } iterations as a method to fix things that can go better.

A broader view:

How long will it take to read ( N ) blocks from an 0.8 [TB] disk file and get ( N - 1 ) of them just sent across a LAN?

For a rough estimate, let's refresh a few facts about how these things actually work:

           0.5 ns - CPU L1 dCACHE reference
           1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
           5   ns - CPU L1 iCACHE Branch mispredict
           7   ns - CPU L2  CACHE reference
          71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
         100   ns - MUTEX lock/unlock
         100   ns - own DDR MEMORY reference
         135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
         202   ns - CPU cross-QPI/NUMA worst case on XEON E7-*
         325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
      10,000   ns - Compress 1K bytes with Zippy PROCESS
      20,000   ns - Send 2K bytes over 1 Gbps NETWORK
     250,000   ns - Read 1 MB sequentially from MEMORY
     500,000   ns - Round trip within a same DataCenter
  10,000,000   ns - DISK seek
  10,000,000   ns - Read 1 MB sequentially from NETWORK
  30,000,000   ns - Read 1 MB sequentially from DISK
 150,000,000   ns - Send a NETWORK packet CA -> Netherlands
|   |   |   |
|   |   | ns|
|   | us|
| ms|

Processors differ a lot in their principal internal ( NUMA in effect ) architectures:

Core i7 Xeon 5500 Series Data Source Latency (approximate)               [Pg. 22]

local  L1 CACHE hit,                              ~4 cycles (   2.1 -  1.2 ns )
local  L2 CACHE hit,                             ~10 cycles (   5.3 -  3.0 ns )
local  L3 CACHE hit, line unshared               ~40 cycles (  21.4 - 12.0 ns )
local  L3 CACHE hit, shared line in another core ~65 cycles (  34.8 - 19.5 ns )
local  L3 CACHE hit, modified in another core    ~75 cycles (  40.2 - 22.5 ns )

remote L3 CACHE (Ref: Fig.1 [Pg. 5])        ~100-300 cycles ( 160.7 - 30.0 ns )

local  DRAM                                                   ~60 ns
remote DRAM                                                  ~100 ns

Yet, while these Intel published details data influence any and all performance planning, these figures are neither granted, not constant, as Intel warns in there published remark:

_{"NOTE: THESE VALUES ARE ROUGH APPROXIMATIONS. THEY DEPEND ON CORE AND UNCORE FREQUENCIES, MEMORY SPEEDS, BIOS SETTINGS, NUMBERS OF DIMMS, ETC,ETC..YOUR MILEAGE MAY VARY."}

I like the table, where orders of magnitude are visible, someone might prefer a visual form, where colours "shift"-the-paradigm, based originally on Peter Norvig's post:

If my budget, deadlines and the simulation software would permit, I would prefer ( performance-wise, due to latency masking and data-locality based processing ) to go without mpich layer into maximum number of CPU + MEM-controller channels per CPU.

With a common sense for architecture-optimised processing design, it was shown, that the same results could be received in a pure-[SERIAL]-code somewhere even ~50x ~ 100x faster than a best case in the "original" code or if using silicon-architecture "naive" programming tools and improper paradigms.

Today systems are capable of serving the vast simulations' needs this way, without spending more than received.

Hope that also your conditions will allow you to go smart into this, performance subordinated direction, to cut the simulations run-time to a way less than a month.

RegCM, MPICH, Computer Clustering

Questions:

There are 1 best solutions below

INTRO:
Simplified Answers to Questions on a yet hidden complex-system:

SOLUTION:
What one can always do about this design dilemma?

Related Questions in PARALLEL-PROCESSING

Related Questions in MPI

Related Questions in MODELING

Related Questions in MPICH

Related Questions in PARALLELISM-AMDAHL

Trending Questions

Popular # Hahtags

Popular Questions

RegCM, MPICH, Computer Clustering

Questions:

There are 1 best solutions below

INTRO:Simplified Answers to Questions on a yet hidden complex-system:

SOLUTION:What one can always do about this design dilemma?

Related Questions in PARALLEL-PROCESSING

Related Questions in MPI

Related Questions in MODELING

Related Questions in MPICH

Related Questions in PARALLELISM-AMDAHL

Trending Questions

Popular # Hahtags

Popular Questions

INTRO:
Simplified Answers to Questions on a yet hidden complex-system:

SOLUTION:
What one can always do about this design dilemma?