Why Amdahl Law on serial and parallel fractions does not provide a theoretical speedup of 4 on quad-core CPU?

Question

Why Amdahl Law on serial and parallel fractions does not provide a theoretical speedup of 4 on quad-core CPU?

804 Views Asked by Andrew At 11 April 2025 at 04:02

I have a code
( the Floyd-Warshall algorithm for the shortest path in an NxN matrix ),
with three for-loops, one within the other and with the same number of cycles.

In the last for I have an assignment via a ternary-operation = <bool> ? <val1> : <val2> - based on a comparison and if it is True or not.

I used an OpenMP to parallelize the second for with a #pragma omp parallel for.

I can't compute the parallel percentage and serial percentage of the code to successfully apply the Amdahl Law to recover the theoretical speedup.

  for (k = 0; k < N; k++)
    #pragma omp parallel for  private(i,j) shared(x)  num_threads(4)
    for (i = 0; i < N; i++){
        for (j = 0; j < N; j++) {
            x[i][j] =  x[i][j] < (x[i][k] + x[k][j]) ? x[i][j] : (x[i][k] + x[k][j])  ;
        }
    }

I'm using four cores, so I expect a theoretical speedup of 4.

X[i][j] is the matrix, where every element acts as the weight of the edge which connects nodes i and j; it is the macro INF ( infinite ) if they're not connected.

Original Q&A

There are 2 best solutions below

user3666197 On 12 January 2017 at 15:41

TL;DR:

it is great that universities spend more time in Amdahl's Law practical examples to show, how easily the marketing girls and boys create the false expectations on multi-core and many-core toys.

That said, let's define the test-case:

The problem in Floyd-Warshall Processing could be structured into:

process launch overheads
data-structures memory allocations + setup
data-values initialisations
Floyd-Warshall specific conditions ( Zeroised diagonal, etc. )
Section timing tools
Section with Floyd-Warshall O(N^3) process with a potential Improvement-Under-Test [IUT]

Amdahl's Law declares an ultimate limit for any Process overall "improvement", given the section [6] contains an [IUT] to be evaluated, while the overall "improvement" will NEVER become better than ~ 1 / ( ( 1 - IUT ) + ( IUT / N ) ).

Kind readers are left to test and record the timing for the ( 1 - IUT ) part of the experiment.

How to compute an effect of the potentially parallelised `[IUT]` in the section [6] of the code?

First, let's focus on what happens in the originally posted code, in a pure SEQ ( serial ) code-execution flow:

The inital snippet already had some space for performance improvement, even without OpenMP based attempt to distribute the task onto larger resources-base:

 for        ( k = 0; k < N; k++ )
    for     ( i = 0; i < N; i++ ){
        for ( j = 0; j < N; j++ ){
            x[i][j] =  x[i][j] >  ( x[i][k] + x[k][j] )         // .TEST   <bool>
                               ?  ( x[i][k] + x[k][j] )         // .ASSIGN <val1>
                               :    x[i][j];                    // .ASSIGN <val2>
        }
    }

If this were run as a purely SEQ solo or under an attempt to harness the #pragma omp, as was posted in the original question in both cases the Amdahl's Law will show ZERO or even "negative" improvement.

Why? Why not a speed-up of 4? Why even no improvement at all?

Because the code was instructed to run "mechanically" repeated on all resources, running exactly the same, identical scope of the task for full 4 times, shoulder-on-shoulder, each one besides the others, so the 4-times more resources did not bring any expected positive effect, as they have together spent the same time to co-run all the parts of the task 4-times independently each on the others' potential "help" ( if not worse, due to some cases, when a resource contention was observed during the whole task running ).

So, let's rather use the OpenMP strengths to split the task and let each of the resource process just the adequate portion of the scope of the algorithm ( thanks to the Floyd-Warshall algorithm, as this is a lot forgiving in this direction and allows that, because it's processing scheme, even when negative weights are allowed, is non-intervening, so no hostile barriers, syncs, critical-section are needed to propagate anything among the threads )

So, can we get any OpenMP benefit here? Oh yes, a lot:

#include "omp.h"                        // .MUST SET a gcc directive // "-fopenmp"
     // --------------------------------------------------------[1] ref. above
void main(){
           int i, j, k;
     const int N = 100;
           int x[100][100];
     // --------------------------------------------------------[2] ref. above
     // --------------------------------------------------------[3] ref. above
     // --------------------------------------------------------[4] ref. above
     for (          k = 0; k < N; k++ )
     {  
     // --------------------------------------------------------[5] ref. above
        //------------------------------------------------------[6]----- OMP
       // ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
      //              PARALLEL is not precise, "just"-CONCURRENT is EXACT IN THE SECTION LEVEL BELOW
          #pragma omp parallel for  private(i,j) shared(x)  num_threads(4)
          for (     i =   0; i < N; i++ ){                                                                                  // .MUST  incl.actual k-th ROW, in case NEG weights are permitted
              int  nTHREADs = omp_get_num_threads();                  // .GET "total" number of spawned threads
              int       tID = omp_get_thread_num();                   // .GET "own"       tID# {0,1,..omp_get_num_threads()-1} .AVOID dumb repeating the same .JOB by all spawned threads
              for ( j = tID; j < N; j += nTHREADs ){                  // .FOR WITH tID#-offset start + strided .INC STEP    // .MUST  incl.actual k-th COL, in case NEG weights are permitted
                  // - - - - - - - - - - - - - - - - - - - - - - - -  
                  // SINCE HERE:                                      // .JOB WAS SPLIT 2 tID#-ed, NON-OVERLAPPING tasks
                  // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ // .N.B:  dumb "just"-CONCURRENT processing is O.K. here
                  // ................................................ //                             1-thread  .INC STEP   +1  a sure ZERO Amdahl-Law effect ( will bear an adverse penalty from use-less omp_get_*() calls )
                  // °.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°. //                             2-threads .INC STEP   +2     may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP /   2 ) ) if enough free CPU-resources
                  // '-.'-.'-.'-.'-.'-.'-.'-.'-.'-.'-.'-.'-.'-.'-.'-. //                             3-threads .INC STEP   +3     may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP /   3 ) ) if enough free CPU-resources
                  // ^'-.^'-.^'-.^'-.^'-.^'-.^'-.^'-.^'-.^'-.^'-.^'-. //                             4-threads .INC STEP   +4     may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP /   4 ) ) if enough free CPU-resources
                  // o1234567o1234567o1234567o1234567o1234567o1234567 //                             8-threads .INC STEP   +8     may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP /   8 ) ) if enough free CPU-resources
                  // o123456789ABCDEFo123456789ABCDEFo123456789ABCDEF //                            16-threads .INC STEP  +16     may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP /  16 ) ) if enough free CPU-resources
                  // o123456789ABCDEFGHIJKLMNOPQRSTUVo123456789ABCDEF //                            32-threads .INC STEP  +32     may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP /  32 ) ) if enough free CPU-resources
                  // o123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijkl //                            64-threads .INC STEP  +64     may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP /  64 ) ) if enough free CPU-resources
                  // o123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijkl //                           128-threads .INC STEP +128     may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP / 128 ) ) if enough free CPU-resources

                  int            aPair = x[i][k] + x[k][j];           // .MUST   .CALC ADD( x_ik, x_kj ) to TEST            // .MAY  smart re-use in case .GT. and ASSIGN will have to take a due place
                  if ( x[i][j] > aPair ) x[i][j] = aPair;             // .IFF    .UPD                                       // .AVOID dumb re-ASSIGN(s) of self.value(s) to self
                  // - - - - - - - - - - - - - - - - - - - - - - - -
              }
          }
       // ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
     }// --------------------------------------------------------------- OMP
     return;
}

Understanding the OpenMP beyond an Amdahl's Law predicted limit:

The proposed approach opens some potential for further exploration by some funny experimentation:

setup the number of threads as 1 ( to use as an experimentation baseline )
setup the number of threads as ( nCPUcores / 2 )
setup the number of threads as ( nCPUcores - 1 )
setup the number of threads as ( nCPUcores ) + run disk defragmentation/compression
setup the number of threads as ( nCPUcores * 2 )
setup the number of threads as ( nCPUcores * 2 ) + enforce CPU-affinity on just 2 CPU-cores
setup the number of threads as ( nCPUcores * 20 )
setup the number of rows/cols N ~ { 1.000 | 10.000 | 100.000 | 1.000.000 } and check the effects

**MrSmith42** · Accepted Answer

The two inner loops and the body of the innermost loop are executed in serial on each core.That's because you marked the outer loop to be executed in parallel.

But:

I would expect far less than a speedup of 4. There is always a communication-overhead
The body in the innermost loop uses the same matrix for all cores and also modifies the matrix. therefore the changes in the matrix must be propageted to all other cores. This might in the lead to the following problems:
1. CPU-caches might be useless because the cached array-elements might be changed by another core which might have a different cache.
2. In worst case all modifications in the matrix depend on the previous change (I haven't check this for your case). In this case no parallel execution is possible => no speedup at all for more than one core.

You should check if it is possible to change your algorithm in a way that not intersecting partial matrixes can be processed. You will get the best speedup if the cores work on separate not intersecting data.

Since there is nearly no effort in doing so you should definitively profile the code and variants of it.

Why Amdahl Law on serial and parallel fractions does not provide a theoretical speedup of 4 on quad-core CPU?

There are 2 best solutions below

The problem in Floyd-Warshall Processing could be structured into:

How to compute an effect of the potentially parallelised `[IUT]` in the section [6] of the code?

Why? Why not a speed-up of 4? Why even no improvement at all?

So, can we get any OpenMP benefit here? Oh yes, a lot:

Understanding the OpenMP beyond an Amdahl's Law predicted limit:

Related Questions in ALGORITHM

Related Questions in PERFORMANCE

Related Questions in OPTIMIZATION

Related Questions in PARALLEL-PROCESSING

Related Questions in PARALLELISM-AMDAHL

Trending Questions

Popular # Hahtags

Popular Questions

Why Amdahl Law on serial and parallel fractions does not provide a theoretical speedup of 4 on quad-core CPU?

There are 2 best solutions below

The problem in Floyd-Warshall Processing could be structured into:

How to compute an effect of the potentially parallelised [IUT] in the section [6] of the code?

Why? Why not a speed-up of 4? Why even no improvement at all?

So, can we get any OpenMP benefit here? Oh yes, a lot:

Understanding the OpenMP beyond an Amdahl's Law predicted limit:

Related Questions in ALGORITHM

Related Questions in PERFORMANCE

Related Questions in OPTIMIZATION

Related Questions in PARALLEL-PROCESSING

Related Questions in PARALLELISM-AMDAHL

Trending Questions

Popular # Hahtags

Popular Questions

How to compute an effect of the potentially parallelised `[IUT]` in the section [6] of the code?