I am using Intel i5 processor with 4 cores and 4 threads. Currently I am working on simulating mandelbrot set using pthreads and ISPC(Intel SPMD Program Compiler). When I used two threads for computing mandelbrot set image, based on the task division i.e spatial decomposition of image, I see 1.9x speed and when I use 3 threads I see 1.65x speed up and with 4 threads speed up satturates to 2.4x. Since i5 has 4 threads, it is expected to have 4x speed up with ample parallelism in the program (using pthread).Why there is decline in speed up when using 3 threads? What are the reasons that I don't see expected speed up?What are the ways to get higher speed up with ample parallelism in case of mandelbrot?
Note: I am using gcc to compile with pthreads as API. The division is based on spatial decomposition of image. I am not using any locks or semaphores.
wiki link for mandelbrot http://en.wikipedia.org/wiki/Mandelbrot_set
github link for ISPC documents http://ispc.github.io/
In case if you find that the questions are irrelevant please redirect me to appropriate sources.Thank you for your time.
Spatial decomposition does not know how expensive each pixel is to compute. One pixel may require 400 iterations while its neighboring pixel can complete in just 2 iterations. Due to this, you should use a load-balancing algorithm and the easiest is to atomically increment a work-item index value from all participating threads.
Here is my solution to load-balancing a mandelbrot-like unknown workload:
If you don't write a load-balancer, then you can still fake a load-balancing by simply launching 100 threads and let the CPU/OS scheduler keep the cores busy until mandelbrot is finished. I tried this too, but the atomic load-balancer had ~10-15% higher performance due to not creating/destroying too many threads.
With dedicated threads that are kept alive, the performance gain would be a bit higher for the load-balancer (perhaps ~100 microseconds per thread).
Atomic load balancing is performance-aware. If you write an iterative load-balancer instead, then you lose performance in first few iterations (rendering first few frames/zooms) until a fair balance is achieved. But with atomic load balancing, it is complete within the first frame. The advantage of iterative load-balancer is that it does not require threads to communicate through an atomic variable and this may become faster for some algorithms but mandelbrot rendering is performing better with atomic load balancing.
In your case, it probably worked like this:
2 threads:
Both threads do work!
3 threads:
Only 1 thread does work!
so when you added a fourth thread, you doubled the number of threads working for the expensive interior again:
with the extra thread working on the other empty area for a minor speedup.
You can also do the spatial distribution on 2D chunks but it is still not free of unknown-work-per-pixel-based slowdowns. You can partially solve this problem by sampling pixels of 4 corner points of each 2D chunk. This way, you can get an idea about how expensive it is to compute whole chunk. For example, if 4 corner pixels sampled take different amount of iterations, then quite possibly the interior of chunk will behave similarly so that you can compute a "cost" estimation per 2D chunk and sort all chunks on their cost values then start processing them one by one starting from the highest cost. After you place all the highest-cost chunks on all cores, the cores will be more busy than simple spatial work distribution and without any atomic messaging. But this has a cost of computation of 4 pixels per chunk multiplied by total number of chunks.
On FX8150 3.6GHz CPU (8 cores), 2000x2000 pixels, 35 max iterations per pixel, rendering yielded these running-times:
with SIMD (AVX) usage through auto-vectorization of GCC compiler. This was partly due to the CPU having only 4 real FPUs shared by 8 cores that made it look like only 4 threads sharing the rendering workload but the load balancing made it much better.
There are also more efficient work distribution algorithms but they do not map well to x86 architecture. For example, you can:
This way, each new queue can work with simple equal distribution on n threads. But x86 does not have hardware acceleration for this workflow and would bottleneck on the enqueue steps. Its probably much better on gpus as they have hardware acceleration for all the synchronization primitives that x86 lacks.