Lockless deque with non atomic sized items

198 Views Asked by At

I'm using a worker deque as described in "Correct and Efficient Work-Stealing for Weak Memory Models". I want queue items to be 16 bytes in size, and I only care about Intel/AMD Windows x64 and VS 2019.

I understand that 16 byte (Say __m128) aligned load/stores are typically atomic in modern processors, but they are not guaranteed by specification.

The types for the deque are:

typedef struct {
    atomic_size_t size;
    atomic_int buffer[];
} Array;


typedef struct {
    atomic_size_t top, bottom;
    Atomic(Array *) array;
} Deque;

Importantly the array buffer items specifically have an atomic type. If I compile this with VS2019 I can see that it bloats the buffer item size with a spinlock - I don't want this. Is it possible to prevent it? Specifically as I only care about x64 which comes with certain guarantees.

The actions on the deque are given by the functions:

int take(Deque* q) {
    size_t b = load_explicit(&q->bottom, relaxed) - 1;
    Array* a = load_explicit(&q->array, relaxed);
    store_explicit(&q->bottom, b, relaxed);
    thread_fence(seq_cst);
    size_t t = load_explicit(&q->top, relaxed);
    int x;
    if( t <= b ) {
        /* Non-empty queue. */
        x = load_explicit(&a->buffer[b % a->size], relaxed);
        if( t == b ) {
            /* Single last element in queue. */
            if( !compare_exchange_strong_explicit(&q->top, &t, t + 1, seq_cst, relaxed) )
                /* Failed race. */
                x = EMPTY;
            store_explicit(&q->bottom, b + 1, relaxed);
        }
    } else { /* Empty queue. */
        x = EMPTY;
        store_explicit(&q->bottom, b + 1, relaxed);
    }
    return x;
}


void push(Deque* q, int x) {
    size_t b = load_explicit(&q->bottom, relaxed);
    size_t t = load_explicit(&q->top, acquire);
    Array* a = load_explicit(&q->array, relaxed);
    if( b - t > a->size - 1 ) { /* Full queue. */
        resize(q);
        a = load_explicit(&q->array, relaxed);
    }
    store_explicit(&a->buffer[b % a->size], x, relaxed);
    thread_fence(release);
    store_explicit(&q->bottom, b + 1, relaxed);
}


int steal(Deque* q) {
    size_t t = load_explicit(&q->top, acquire);
    thread_fence(seq_cst);
    size_t b = load_explicit(&q->bottom, acquire);
    int x = EMPTY;
    if( t < b ) {
        /* Non-empty queue. */
        Array* a = load_explicit(&q->array, consume);
        x = load_explicit(&a->buffer[t % a->size], relaxed);
        if( !compare_exchange_strong_explicit(&q->top, &t, t + 1, seq_cst, relaxed) )
            /* Failed race. */
            return ABORT;
    }
    return x;
}

With a lot of that being redundant and should optimise out on x64. In fact the paper specifies only a memory fence is needed in the take function at the thread_fence(seq_cst) line. Although I am not sure if this is true if the queue item type is 16 bytes in size?

It seems as take()/push() must happen in the same thread, so there is no issue between them. Thus the danger is any thread that calls steal() reading a partially written 16 byte item. But as push() does a memory fence only after all 16 bytes are written, and updates bottom only after that, it seems on x64 this isn't a concern?

I have done an experiment where I removed the atomic qualifier on the buffer items and used plain assignments to and from the buffer via a volatile pointer. And it appeared to work fine, but obviously that is no certainty!

If this is not possible then perhaps using a cmpxchg16b is a better option to load/store the 16 bytes my specific case? Or complicating it all by having queue items as indices, and locklessly allocating 16 byte slots that are indexed.

So the simplified version of my question is: On x64 can I simply change the definition of the Array buffer type to a volatile pointer to an array of non atomic qualified 16byte struct items, and change the load and store of those items in the above functions to simple non atomic assignment expressions?

1

There are 1 best solutions below

2
On

This may be only a partial answer, as I attempt to answer about 16-bytes atomic, but I have no clue why you need elements of the queue to be atomic, and not only counters.


The easiest way to accomplish this with MSVC 2019 while still writing potentially portable code is either:

  • Use std::atomic_ref<16 bytes type>
  • #define _STD_ATOMIC_ALWAYS_USE_CMPXCHG16B
  • static_assert on atomic_ref<...>::is_always_lock_free to ensure you've got real lock-free atomic

or:

  • Use boost::atomic<16 bytes type> or boost::atomic_ref<16 bytes type>
  • Again, static_assert on is_always_lock_free to ensure you've got real lock-free atomic

MSVC might have implemented the same thing for std::atomic<16 bytes type>, but it still didn't due to ABI compatibility issue. (It is expected to implement this in some future version that will break the current ABI)

This will give you cmpxchg16b for every operation, even for relaxed load and relaxed store. And this is the best you can get guaranteed on any CPU.

There are single-instruction loads/stores for 128 bit type, but they are not guaranteed to be atomic. See SSE instructions: which CPUs can do atomic 16B memory operations?


Regarding volatile

In MSVC volatile reads and assignments are controlled by /volatile:ms / /volatile:iso. See /volatile (volatile Keyword Interpretation). That is with /volatile:ms, which is default on x86-64, loads are supposed to be acquire and stores are supposed to be release.

However:

  • simple loads/stores are not guaranteed to be atomic for 128-bit type on any CPU (see the linked answer above)
  • even if for a CPU they are atomic, the 128 bit types are not atomic for the compiler, so it will not necessarily generate single instruction access.

So if you still want to rely on 128 bit loads/stores working as atomic on your target CPU, you'd better use intrinsics _mm_load_si128 / _mm_store_si128, use alignas(16) to ensure the proper alignment, and use _ReadWriteBarrier to prevent compiler reordering. This may work (compiler is likely to generate 128-bit load/store, as asked, and these load/store may be indeed atomic on a given CPU).