My understanding is that
- CAS, FAA, and similar operations usually operate on the machine word size (e.g. 64 bits).
- The hard parts of implementing these operations in hardware has to do with communication/synchronization/caches/etc across cores.
- All cross-core communication is already in units of cache lines, which are a multiple of the machine word size (e.g. 64 bytes vs 64 bits).
- There is such a thing as double width CAS (e.g. CMPXCHG16B), so there is clearly some need for bigger-than-machine-word atomics, and ability to implement them.
- However there is no quad-width or octo-width (e.g. 32bytes, 64bytes) CAS (I think?)
- Similarly there is no vectorized (e.g. SSE or AVX) CAS (I think?)
Why do larger CAS sizes not exist? Is it 100% because they are weird and no one thinks of a use for them, or are there implementation difficulties? To me it seems like the implementation comes "for free" for widths up to the cache line size, so they may as well be provided by the hardware.