Background:
I need to run a huge computation for climate simulation with over 800 [GB]
of data ( for the past 50 years and future 80 years ).
For this, I'm using RegCM4 based in linux. I am using Ubuntu. The most powerful system we have has some Intel XEON processor with 20 cores. Also we have almost 20 smaller less powerful Intel i7 octa-core processors.
To run the simulations, the single system will take more than a month.
So, I've been trying to set up computer clusters with available resources.
(FYI: RegCM allows parallel processing with mpi
.)
Specs::
Computer socket cores_per_socket threads_per_core CPUs RAM Hard_drives
node0 1 10 2 20 32 GB 256 GB + 2 TB-HDD
node1 1 4 2 8 8 GB 1 TB-HDD
node2 1 4 2 8 8 GB 1 TB-HDD
-> I use mpich
v3 ( I don't remember exact version no. )
And so on... ( all the nodes other than node0
are the same as node1
.)
All nodes have 1 Gbps
supported ethernet cards.
For test purpose I have set up a small simulation work for analyzing 6 days of climate. All test simulations used same parameters and model settings.
All nodes boot from their own HDD.
node0
runs on Ubuntu 16.04 LTS.
other nodes run Ubuntu 14.04 LTS.
How I started? I followed steps as in here.
- Connected
node1
andnode2
with a Cat 6 cable, assigned them static IP-s. (leftnode0
for now) - edited/etc/hosts
with IP-s and corresponding names -node1
andnode2
as given in table above - setup password-less login with ssh in both - success
- created a folder in
/home/user
innode1
(which will be master in this test) and exported the folder (/etc/exports
), mounted this folder overNFS
tonode2
and edited/etc/fstab
innode2
- success - Ran my
regcm
over the cluster using 14 cores of both machines - success - I have used :
iotop
,bmon
,htop
to monitor disk read/write, network traffic and CPU usage respectively.
$ mpirun -np 14 -hosts node0,node1 ./bin/regcmMPI test.in
Result of this test
Faster computation over a single node processing
Now I tried the same with node0
(see above for computer specs)
-> I am working on SSD in node0
.
-> works fine but the problem is time factor when connected in cluster.
Here's the summary of results::
- first using node0
only - no use of cluster
$ mpirun -np 20 ./bin/regcmMPI test.in
nodes no.of_cores_used run_time_reported_by_regcm_in_sec actual time taken in sec (approx)
node0 20 59.58 60
node0 16 65.35 66
node0 14 73.33 74
this is okay
Now, using cluster
( use following ref to understand the table below ):
rt
= CPU run time reported by regcm in sec
a-rt
= actual time taken in sec (approx)
LAN
= Max LAN speed achieved (Receive/Send) in MBps
disk(0 / 1)
= Max Disk write speed atnode0
/ atnode1
in MBps
nodes* cores rt a-rt LAN disk( 0 / 1 )
1,0 16 148 176 100/30 90 / 50
0,1 16 145 146 30/30 6 / 0
1,0 14 116 143 100/25 85 / 75
0,1 14 121 121 20/20 7 / 0
*note:
1,0 (eg. for 16 cores) means:
$ mpirun -np 16 -hosts node1,node0 ./bin/regcmMPI test.in
0,1 (eg. for 16 cores) means:
$ mpirun -np 16 -hosts node0,node1 ./bin/regcmMPI test.in
Actual run time was calculated manually using start and end time reported by regcm.
We can see above that LAN-usage and drive write speed was significantly different for two options - 1. passing node1,node0
as host ; and 2. passing node0,node1
as host ---- note the order.
Also time for running in single node is faster than running in cluster. Why ?
I also ran another set of test, this time using hostfile (named hostlist) whose content were:
node0:16
node1:6
Now I ran the following script
$ mpirun -np 22 -f hostlist ./bin/regcmMPI test.in
CPU run time was reported 101 [s]
, actual run time was 1 min 42 sec
( 102 [s]
), LAN speed achieved was around 10-15 [MB/s]
, disk write speed was around 7 [MB/s]
.
The best result was obtained when I used the same hostfile setting and ran code with 20 processors thus under-subscribing
$ mpirun -np 20 -f hostlist ./bin/regcmMPI test.in
CPU runtime : 90 [s]
Actual run time : 91 [s]
LAN : 10 [MB/s]
When I changed cores from 20 downto 18, run time increased to 102 [s]
.
I have not yet connected node2
to the system.
Questions:
- Is there a way to achieve faster speed in computation ? Am I doing something wrong ?
- The computation time for single machine with 14 cores is faster than cluster with 22 cores or 20 cores. Why is it happening ?
- What is the optimum number of cores that can be used to achieve time efficiency ?
- How can I achieve best performance with available resources?
- Are there any best mpich usage manual that can answer my questions? (I could not find any such info)
- Sometimes using fewer cores give faster completion time than using higher cores even though I am not using all available cores leaving 1 or 2 cores for OS and other operations in individual nodes. Why is it happening?
While the above commented advice to contact a regional or a national HPC centre is fair and worth to follow, I can imagine, how hard it could get to receive some remarkable processing-quota granted, if both deadlines and budget are moving against you
INTRO:
Simplified Answers to Questions on a yet hidden complex-system:
1:
Is there a way to achieve faster speed in computation ?
Yes.
Am I doing something wrong ?
Not directly.
2:
The computation time for single machine with 14 cores is faster than cluster with 22 cores or 20 cores. Why is it happening ?
You pay more than you get. That easy. The NFS - a network-distributed abstraction of a filesystem is possible, but you pay immense costs for an ease to use it, if performance starts to become an ultimate target. In general all the lumpsum of all sorts of extra-costs paid on ( data distribution + the high add-on overheads ) is any higher than a net-effect of the
[PARALLEL]
-distributable workload-split over a yet low amount of CPU_cores, an actual slowdown instead of speedup appears. This is a common main suspect ( not mentioning switching off the hyper-threading in BIOS per se for computing intensive workloads ).3:
What is the optimum number of cores that can be used to achieve time efficiency ?
First identify the biggest process-bottleneck observed
{ CPU | MEMORY | LAN | fileIO | a-poor-algorithm }
, only next seek a best step to improve speed ( keep iterative moving forwards on this{ cause: remedy }
-chain, while performance still grows ). Never attempt to go in a reversed order.4:
How can I achieve best performance with available resources ?
This one is the most interesting and will require more work to be done ( ref. below ).
5:
Are there any best mpich usage manual that can answer my questions ?
There is no such colour of a LAN-cable, that would decide about its actual speed and performance or that would assure its suitability for some specific use, but the overall system architecture does matter.
6:
Sometimes using fewer cores give faster completion time than using higher cores even though I am not using all available cores leaving 1 or 2 cores for OS and other operations in individual nodes. Why is it happening ?
Ref. [ item 2 above ]
SOLUTION:
What one can always do about this design dilemma?
Before doing any step further, do your best to try to well understand both the original Amdahl's Law + its new overhead-strict re-formulation
Without mastering this basis, nothing else will help you decide the performance-hunt-dilemma-( duality )-of-fair-accounting-of-both-
{ -costs +benefits }
A narrow view:
prefer tests to guesses. ( +1 for running tests )
mpich
is a tool for your code going distributed and for all the related process-management and synchronisations. While weather simulations may enjoy a well established locality of influence ( less inter-process communication and synchronisations are mandatory, but the actual code decides on how many actually do appear to happen ) still the costs of data-related transfers will dominate ( ref. below for orders of magnitude ). If you cannot modify the code, you have to live with it and may just try to change the hardware, that mediates the flow ( from a 1 Gbps interface to 10 GBE to 40 GBE fabrics, if benchmarking tests support that and budget permits ).If you can change the code, take a look on sample Test-cases demonstrated as a methodology for a principal approach to isolate the root-cause of the actual bottleneck and keep the
{ cause: remedy, ... }
iterations as a method to fix things that can go better.A broader view:
How long will it take to read
( N )
blocks from an0.8 [TB]
disk file andget ( N - 1 )
of them just sent across a LAN?For a rough estimate, let's refresh a few facts about how these things actually work:
Processors differ a lot in their principal internal ( NUMA in effect ) architectures:
Yet, while these Intel published details data influence any and all performance planning, these figures are neither granted, not constant, as Intel warns in there published remark:
I like the table, where orders of magnitude are visible, someone might prefer a visual form, where colours "shift"-the-paradigm, based originally on Peter Norvig's post:
If my budget, deadlines and the simulation software would permit, I would prefer ( performance-wise, due to latency masking and data-locality based processing ) to go without
mpich
layer into maximum number of CPU + MEM-controller channels per CPU.With a common sense for architecture-optimised processing design, it was shown, that the same results could be received in a pure-
[SERIAL]
-code somewhere even~50x ~ 100x
faster than a best case in the "original" code or if using silicon-architecture "naive" programming tools and improper paradigms.Today systems are capable of serving the vast simulations' needs this way, without spending more than received.
Hope that also your conditions will allow you to go smart into this, performance subordinated direction, to cut the simulations run-time to a way less than a month.