Why are pthread spinlocks so much faster than mutexes?

267 Views Asked by At

I'm hacking on a fifo/queue for a multithreaded program using the following example as a guide. One thread writes to the fifo and another reads from it. The fifo will only be shared between these two threads. The synchronization overhead is massive; adding and removing 10 million items from the fifo takes about three seconds on my machine if I use mutexes and 1.5 seconds if I use spinlocks instead.

My blocking add function is as follows. The blocking remove function works analogously. The functions were updated, as John Bollinger pointed out that the spinlock code was racy if there were more than one writer or reader.

static void
spin_while_full(volatile synced_queue *me) {
    while (me->queue->n_elements == me->queue->capacity) {
    }
}

void
synced_queue_add(synced_queue *me, void *value) {
    synced_queue_lock_type tp = me->lock_type;
    if (tp == SYNCED_QUEUE_SPIN_LOCK) {
        while (true) {
            pthread_spin_lock(&me->lock);
            if (me->queue->n_elements < me->queue->capacity) {
                break;
            }
            pthread_spin_unlock(&me->lock);
        }
    } else if (tp == SYNCED_QUEUE_SPIN_LOCK_UNCONTESTED) {
        spin_while_full(me);
        pthread_spin_lock(&me->lock);
    } else if (tp == SYNCED_QUEUE_MUTEX) {
        pthread_mutex_lock(&me->mutex);
        while (me->queue->n_elements == me->queue->capacity) {
            pthread_cond_wait(&me->var_prod, &me->mutex);
        }
    } else {
        assert(false);
    }
    queue_add(me->queue, value);
    if (tp == SYNCED_QUEUE_SPIN_LOCK ||
        tp == SYNCED_QUEUE_SPIN_LOCK_UNCONTESTED) {
        pthread_spin_unlock(&me->lock);
    } else {
        pthread_mutex_unlock(&me->mutex);
        pthread_cond_signal(&me->var_cons);
    }
}

The library's user sets the lock_type flag to whatever synchronization method they want. The benchmark results for the three lock types are:

=== benchmark_synced_queue
spinlock               1024 elements  0.17 us/job
uncontested spinlock   1024 elements  0.14 us/job
mutex                  1024 elements  0.31 us/job
spinlock                512 elements  0.17 us/job
uncontested spinlock    512 elements  0.14 us/job
mutex                   512 elements  0.31 us/job
spinlock                256 elements  0.15 us/job
uncontested spinlock    256 elements  0.14 us/job
mutex                   256 elements  0.28 us/job
spinlock                128 elements  0.15 us/job
uncontested spinlock    128 elements  0.14 us/job
mutex                   128 elements  0.29 us/job
spinlock                 64 elements  0.16 us/job
uncontested spinlock     64 elements  0.14 us/job
mutex                    64 elements  0.28 us/job
spinlock                 32 elements  0.18 us/job
uncontested spinlock     32 elements  0.15 us/job
mutex                    32 elements  0.15 us/job
spinlock                 16 elements  0.21 us/job
uncontested spinlock     16 elements  0.16 us/job
mutex                    16 elements  0.30 us/job
spinlock                  8 elements  0.29 us/job
uncontested spinlock      8 elements  0.16 us/job
mutex                     8 elements  0.60 us/job
spinlock                  4 elements  0.43 us/job
uncontested spinlock      4 elements  0.17 us/job
mutex                     4 elements  1.21 us/job

So while the "correct" spinlock is a little slower than the incorrect one, the difference between the spinlock types and the mutex is much bigger.

The pthread manual pretty much discourages the use of spinlocks:

Spin locks should be employed in conjunction with real-time scheduling policies (SCHED_FIFO, or possibly SCHED_RR). Use of spin locks with nondeterministic scheduling policies such as SCHED_OTHER probably indicates a design mistake. ... User-space spin locks are not applicable as a general locking solution. They are, by definition, prone to priority inversion and unbounded spin times. A programmer using spin locks must be exceptionally careful not only in the code, but also in terms of system configuration, thread placement, and priority assignment.

My question is why is locking is so much slower using mutexes than spinlocks? The problem with spinlocks is of course that it taxes the cpu. So is there a way to avoid using spinlocks, but to still get good performance?

2

There are 2 best solutions below

2
Marco Bonelli On

I'll give you my educated guess. It needs some testing to be proven right/wrong, which you should be able to perform yourself.

First of all let me highlight the fundamental implementation difference between spinlocks and mutexes:

  1. Pthread spinlocks are implemented almost entirely in userspace (though they may fall back to actually sleeping through syscalls in case of high contention depending on the implementation). As the name suggest, they make the current CPU "spin" while waiting to acquire the lock. This is usually implemented with atomic compare-and-exchange CPU instructions.
  2. Pthred mutexes are implemented through the futex(2) syscall (at least on Linux) or anyway through a set of syscalls. Contrary to spinlocks, they prevent CPU usage while waiting for a lock by making the current thread sleep at least until the lock is ready to be acquired.

Your implementation uses a busy loop in the case of spin locks VS a pthread condition in the case of mutexes. Again, the first case is likely a userspace-only job and the second case implies sleeping/waking threads through syscalls. This further polarizes the two cases you are testing: in the first one (spinlocks) you will have two threads that will almost entirely hog the CPU for their exclusive use, while in the second one (mutexes) you will have two threads that will "intelligently" sleep/wake trying to minimize CPU usage when unneeded.

By itself, this already introduces a big difference in overhead between the two implementations: the performance of the first one will more or less only be limited by the speed of your CPUs, while the second one also has to rely on the underlying kernel for task sleep/wakeup. Performing a syscall (mutex) is a much more expensive operation than a few hundred CMPXCHG instructions (spinlock).

On top of this, modern CPUs and operating systems use dynamic CPU frequency scaling, which essentially means that the CPU clock speed of each core is not fixed, but continuously adjusted (independently from other cores) based on its utilization either by the kernel, the hardware itself, or a combination of the two (see e.g. Intel P-state).

Therefore, what is likely happening is that the intermittent sleeping caused by mutexes in your second implementation is further degrading performance by making the CPU constantly change clock speed like a roller coaster. I have talked about this phenomenon in more detail (specifically focusing on Linux) in this other answer of mine.

7
John Bollinger On

Why are pthread spinlocks so much faster than mutexes?

Marco has described several good reasons for the specific performance difference you observed, but he didn't make one important observation: the relative timing of your spinlock version vs your mutex + cv version is meaningless because these are not equivalent to each other.

Your mutex + cv alternative correctly waits for the queue to have an available element / have available capacity before proceeding to dequeue or enqueue an element, without any data races.

On the other hand, your spinlock version is racy. It has data races when it spins waiting for the queue to be in a suitable state to proceed, and if there is more than one reader or more than one writer then your queue can break, trying to dequeue an element when there aren't any enqueued, or to enqueue an element when the queue is already at capacity, because of the gap between testing the queue ready and successfully acquiring the spinlock.

To avoid those races, the spinlock version must acquire the spinlock before testing the state of the queue, and if that is found unsuitable, it must release the spinlock and reacquire it before testing again. You can't substitute a spinlock for a mutex for use in conjunction with a CV, so that's going to be a tight loop. For example:

void synced_queue_remove(synced_queue *me, void *value) {
    while (1) {
        pthread_spin_lock(&me->lock);
        if (me->queue->n_elements > 0) break;
        pthread_spin_unlock(&me->lock);
    }
    queue_remove(me->queue, value);
    pthread_spin_unlock(&me->lock);
}

I predict that that will tend to perform very poorly indeed, but such a result speaks to the code overall. It does not make a direct comparison between mutex performance and spinlock performance itself any more than the original example does.