increasing the thread from 2 to 3 doesn't increase the amount of speed up in mandelbrot

530 Views Asked by At

I am using Intel i5 processor with 4 cores and 4 threads. Currently I am working on simulating mandelbrot set using pthreads and ISPC(Intel SPMD Program Compiler). When I used two threads for computing mandelbrot set image, based on the task division i.e spatial decomposition of image, I see 1.9x speed and when I use 3 threads I see 1.65x speed up and with 4 threads speed up satturates to 2.4x. Since i5 has 4 threads, it is expected to have 4x speed up with ample parallelism in the program (using pthread).Why there is decline in speed up when using 3 threads? What are the reasons that I don't see expected speed up?What are the ways to get higher speed up with ample parallelism in case of mandelbrot?

Note: I am using gcc to compile with pthreads as API. The division is based on spatial decomposition of image. I am not using any locks or semaphores.

wiki link for mandelbrot http://en.wikipedia.org/wiki/Mandelbrot_set

github link for ISPC documents http://ispc.github.io/

In case if you find that the questions are irrelevant please redirect me to appropriate sources.Thank you for your time.

1

There are 1 best solutions below

0
On

Spatial decomposition does not know how expensive each pixel is to compute. One pixel may require 400 iterations while its neighboring pixel can complete in just 2 iterations. Due to this, you should use a load-balancing algorithm and the easiest is to atomically increment a work-item index value from all participating threads.

Here is my solution to load-balancing a mandelbrot-like unknown workload:

atomic_index = 0;
launch 3 threads
   in each thread:
       chunk = 100-150 pixels or maybe bigger
       my_index = atomic_index.fetch_add(chunk);
       for(j = 0 to chunk)
           i = my_index + j
           if i>=total 
               break
           else
               compute_mandelbrot(i%width, i/width) 

If you don't write a load-balancer, then you can still fake a load-balancing by simply launching 100 threads and let the CPU/OS scheduler keep the cores busy until mandelbrot is finished. I tried this too, but the atomic load-balancer had ~10-15% higher performance due to not creating/destroying too many threads.

With dedicated threads that are kept alive, the performance gain would be a bit higher for the load-balancer (perhaps ~100 microseconds per thread).

Atomic load balancing is performance-aware. If you write an iterative load-balancer instead, then you lose performance in first few iterations (rendering first few frames/zooms) until a fair balance is achieved. But with atomic load balancing, it is complete within the first frame. The advantage of iterative load-balancer is that it does not require threads to communicate through an atomic variable and this may become faster for some algorithms but mandelbrot rendering is performing better with atomic load balancing.

In your case, it probably worked like this:

2 threads:

"15 milliseconds run-time"
- - - - - - - - - \
- - - - - - - - -  =>
- - - - - - - - -  => 
- - - - - - - - -  => thread 1's spatial share (15 milliseconds)
- x x x - - - - -  =>
- - x x x x - - - / 
- - x x x x - - - \  
- x x x - - - - -  =>
- - - - - - - - -  =>
- - - - - - - - -  => thread 2's spatial share (15 milliseconds)
- - - - - - - - -  => 
- - - - - - - - - /
  • first thread: half empty, half mandelbrot surface
  • second thread: half mandelbrot surface, half empty

Both threads do work!

3 threads:

"25 ms run-time"
- - - - - - - - - \
- - - - - - - - -  =>
- - - - - - - - -  => thread 1's spatial share (5 milliseconds)
- - - - - - - - - /
- x x x - - - - - \
- - x x x x - - -  =>
- - x x x x - - -  => thread 2: (25 ms) Bottleneck!
- x x x - - - - - /
- - - - - - - - - \
- - - - - - - - - =>
- - - - - - - - - =>  thread 3's spatial share (5 milliseconds)
- - - - - - - - - /
  • first thread: empty pixels
  • second thread: mandelbrot surface with a lot of depth. Expensive!
  • third thread: empty pixels

Only 1 thread does work!

so when you added a fourth thread, you doubled the number of threads working for the expensive interior again:

"12 milliseconds run-time"
- - - - - - - - - \ 
- - - - - - - - -  => thread 1: 3 milliseconds
- - - - - - - - - /  
- - - - - - - - - \
- x x x - - - - -  => thread 2: 12 milliseconds
- - x x x x - - - /
- - x x x x - - - \
- x x x - - - - -  => thread 3: 12 milliseconds
- - - - - - - - - /
- - - - - - - - - \
- - - - - - - - -  => thread 4: 3 milliseconds
- - - - - - - - - /
  • first thread: empty
  • second thread: half of mandelbrot surface
  • third thread: half of mandelbrot surface
  • fourth thread: empty

with the extra thread working on the other empty area for a minor speedup.

You can also do the spatial distribution on 2D chunks but it is still not free of unknown-work-per-pixel-based slowdowns. You can partially solve this problem by sampling pixels of 4 corner points of each 2D chunk. This way, you can get an idea about how expensive it is to compute whole chunk. For example, if 4 corner pixels sampled take different amount of iterations, then quite possibly the interior of chunk will behave similarly so that you can compute a "cost" estimation per 2D chunk and sort all chunks on their cost values then start processing them one by one starting from the highest cost. After you place all the highest-cost chunks on all cores, the cores will be more busy than simple spatial work distribution and without any atomic messaging. But this has a cost of computation of 4 pixels per chunk multiplied by total number of chunks.

On FX8150 3.6GHz CPU (8 cores), 2000x2000 pixels, 35 max iterations per pixel, rendering yielded these running-times:

  • 8 threads with equal distribution: 30+ milliseconds
  • 32 threads with equal distribution: 21 milliseconds
  • 128 threads with equal distribution: ~25 milliseconds
  • 8 threads with atomic-load-balancing: 18 milliseconds

with SIMD (AVX) usage through auto-vectorization of GCC compiler. This was partly due to the CPU having only 4 real FPUs shared by 8 cores that made it look like only 4 threads sharing the rendering workload but the load balancing made it much better.

There are also more efficient work distribution algorithms but they do not map well to x86 architecture. For example, you can:

  • do first iterations of all pixels
  • enqueue second iterations to a queue but only from pixels that still require a second iteration
  • do second iterations and enqueue third iterations
  • repeat until queue gets only zero ierations

This way, each new queue can work with simple equal distribution on n threads. But x86 does not have hardware acceleration for this workflow and would bottleneck on the enqueue steps. Its probably much better on gpus as they have hardware acceleration for all the synchronization primitives that x86 lacks.