Why would using 8 threads be faster than 4 threads on a 4 core Hyper Threaded CPU?

2.7k Views Asked by At

I have a quad core i7 920 CPU. It is Hyperthreaded, so the computer thinks it has 8 cores.

From what I've read on the interweb, when doing parallel tasks, I should use the number of physical cores, not the number of hyper threaded cores.

So I have done some timings, and was surprised that using 8 threads in a parallel loop is faster than using 4 threads.

Why is this? My example code is too long to post here, but can be found by running the example here: https://github.com/jsphon/MTVectorizer

A chart of the performance is here:

enter image description here

2

There are 2 best solutions below

0
On

(Intel) hyperthreaded cores act like (up to) two CPUs.

The observation is that a single CPU has a set of resources that are ideally busy continuously, but in practice sit idle surprising often while the CPU waits for some external event, typically memory reads or writes.

By adding a bit of additional state information for another hardware thread (e.g., another copy of the registers + additional stuff), the "single" CPU can switch its attention to executing the other thread when the first one blocks. (One can generalize this N hardware threads, and other architectures have done this; Intel quit at 2).

If both hardware threads spend their time waiting for various events, the CPU can arguably do the corresponding processing for the hardware threads. 40 nanoseconds for a memory wait is a long time. So if your program fetches lots of memory, I'd expect it to look as if both hardware threads were fully effective, e.g, you should get nearly 2x.

If the two hardware threads are doing work that is highly local (e.g., intense computations in just the registers), then internal waits become minimal and the single CPU can't switch fast enough to service both hardware threads as fast as they generate work. In this case, performance will degrade. I don't recall where I heard it, and I heard this a long time ago: under such circumstances the net effect is more like 1.3x than the idealized 2x. (Expecting the SO audience to correct me on this).

Your application may switch back and forth in its needs depending on which part is running at the moment. Then you will get a mix of performance. I'm happy with any speed up I can get.

1
On

Ira Baxter has explained your question pretty well, but I want to add one more thing (can't comment on his answer cuz not enough rep yet): there is an overhead to switching from one thread to another. This process, called context switching (http://wiki.osdev.org/Context_Switching#Hardware_Context_Switching), requires at minimum your CPU core to change its registers to reflect data in the new thread. This cost is significant if you are doing process-level context switching, but gets quite a bit cheaper when you are doing thread-level switching. This means 2 things:

1) Hyper threading will never give you the theoretical 2x performance boost because the cost of context switching is non-trivial. This is also why highly logical threads degrade performance, per Ira: frequent context switching multiplies that cost.

2) 8 single-threaded processes will run slower than 4 double-threaded processes doing the same work. Thus, you should make use of Python's thread library, or the awesome greenlet library (https://greenlet.readthedocs.org/en/latest/) if you plan on doing multithreading work.