Can I use LDREX/STREX to implement a spin lock without enabling SCU in a multicore ARM Cortex-A9 SoC?

2.1k Views Asked by At

I know this might be a strange usage. I just want to know if I can use LDREX/STREX with SCU disabled.

I am using a dual-core Cortext-A9 SoC. The two cores are running in an AMP mode: each core has its own OS. Although memory controller is shared resource, each core has its own memory space. One can't access the other's memory space. Because no cache coherency is required, SCU isn't enabled. At the same time, I also have a shared memory region that both cores can access to. The shared memory region is non-cached to avoid cache coherency issue.

I define a spin lock in this shared memory region. This spin lock is used to protect shared resource accessing. Right now, the spin lock is implemented simply like this:

void spin_lock(uint32_t *lock)
{
    while(*lock);
    *lock = 1;
}
void spin_unlock(uint32_t *lock)
{
    *lock = 0;
}

where, lock is a variable in shared memory so both core can access this lock.

The problem of this implementation is that accessing lock is not exclusive. That's why I want to use LDREX/STREX to implement spin lock. Please allow me to restate my question:

Can I use LDREX/STREX without SCU enabled?

Thank you!

3

There are 3 best solutions below

2
On

So ... the direct answer to your question is that, yes, it is possible - so long as something else out in the memory system implements an exclusive monitor for the shared memory region. If it does not, then your STREXs will always return OK (rather than EXOK), observable as a failure in the result register.

However, why would you not enable the SCU? Clearly, what you are trying to do requires a coherent view of memory between the two operating systems for at least that region. And with PIPT data caches, you are not going to see any aliasing of cache lines depending on how they are mapped in each image.

3
On

Overall, the answer is no. There are two issues here:

1) You cannot use load/store exclusive on uncached memory. The exclusive operations operate only on "normal" idempotent memory.

2) The ARM manual doesn't specify how exclusive monitors work in conjunction with memory coherence, but any sane implementation is essentially going to put the monitor in the cache line acquisition mechanism. If you disabled cache line snooping, you have most likely rendered the monitors non-functional on your chip.

4
On

Your only (poorly formed) question,

Can I use LDREX/STREX without SCU enabled?

In an ideal ARM universe, yes, it is possible. Ie, it is possible that somewhere, some day you might be able to do this. I think you mean,

Can I use LDREX/STREX without SCU enabled in my system?

Unfortunately, the ARM ARM is a bit of a political/bureaucratic document. You must take extreme care when reading "strongly advised", "UNPREDICTABLE" "UNKNOWN" and can. All programmers would desire the ldrex/strex to apply to all memory. In fact, if the BUS controller (typically AXI-NIC) implemented a monitor, then there would be no trouble to support the much loved swp instruction. There are various posts on StackOverflow where people want to replace the swp with an ldrex/strex.

After you read and re-read the double speak (it is written for the programmer, but also the silicon implementer) of the ARM ARM, it becomes pretty clear that the monitor logic is probably implemented in the cache. A cache controller must implement dirty line broadcasts. Dirty line broadcasts are very similar to a 'monitor' and your 'reserve granule' is most likely a cache line size (what a co-incidence).

The ARM ARM is written as a generic document for people who may wish to implement a Cortex-A CPU. It is written so that their hands (creativity) are not tied to implement the monitor with-in the cache.

So you need to read the specific documentation on your particular Cortex-A9 SOC. It will probably only support ldrex/strex with cached memory. In fact, it is advisable to issue a pld to ensure the memory is in cache before doing the ldrex and this will mean you need to activate the SCU in your system. I guess you are concerned about some additional cycle(s) that the SCU will add to latency?

I think some of this information has confuse many extremely intelligent people. Beware the difference between possible and is. Every person on StackOverflow probably desires the case where the monitor is implemented in the bus controller (or core memory chip). However, for most real chips, this is not the case.

For certain, if you want to future proof your code/OS to port to newer or different Cortex-A CPUs, you should not make this assumption even if your chipset does support a 'global monitor' outside the cache sub-systems.