atomic linked-list LIFO in AArch64 assembly, using load or store between ldxr / stxr

457 Views Asked by At

I had implemented a LIFO for shared memory context using assembly for ARMv8 64bit.

The LIFO inserts a node in beginning and each node structure's first attribute must be next pointer.

Is this correct assembly for implementing atomic insert and delete for the LIFO?

It uses LL/SC with an extra load or store between the LDXR and STXR to read the head->next or store a pointer into a new node.

typedef union {
   void * head[1];
}lifo;
int atomic_lifo_init(lifo * h) {
if (h) {
   h->head[0]=NULL;
  }
}
inline void *
 atomic_lifo_delete (lifo *h)
 {
         void    *ret = NULL;
         /*sa_ignore UNUSED_VAR*/
         void *  tmp = NULL;
 
         asm volatile ("\n"
                 "2:  ldxr      %0,[%2] \n" //load the head from h, which points the 1st node of list to local ret pointer.
                 "    cbz     %0, 3f \n" //check if the lifo is empty.
                 "    ldr      %1,[%0] \n" //store in tmp the 2nd node by derefencing the ret (h->head points 1st node. value of each node is next node as 1st attribute of node structure is pointing next.)
                 "    stxr     %w1, %1,[%2] \n" //update h->head with tmp.
                 "    cbnz     %w1, 2b \n" //continue if failed
                 "3:              \n"
                 : "=&r" (ret), "=&r" (tmp)
                 : "r" (h)
                 : "memory"
                 );
         return(ret);
 }
 
 /*
  * atomic_lifo_insert()
  *      Put an element on a list, protected against SMP
  */
 void
 atomic_lifo_insert (lifo *h, void *__new)
 {
         /*sa_ignore UNUSED_VAR*/
         void * next = NULL;
         void * flag = NULL;
         asm volatile (" \n"
                 "1: ldxr      %1,[%2] \n" //load head[0] from h,which points 1st node to local next pointer
                 "   str      %1,[%3] \n" //store the local next pointer to value of __new, as 1st attribute of the any node is next (convention used here). so __new's next is pointing current 1st node.
                 "   stxr    %w0, %3,[%2] \n" //update the h->head with 
   __next.
                 "   cbnz    %w0, 1b \n" //if stxr is failure try again.
                 : "=&r" (flag), "=&r" (next)
                 : "r" (h), "r" (__new)
                 : "memory"
                 );
 }

I am really new to ARM assembly, so any help is really appreciated.

1

There are 1 best solutions below

8
On

Your inline asm constraints look correct, this should compile the way you intended. You probably could use "+m" (*h) to let the compiler pick an addressing mode instead of hard-coding it with "r"(h) and [%2].

As far as ordering, the ldr is dependency-ordered after the ldxr (like C11 memory_order_consume), so that works.

But the str between the LL/SC in insert might not become visible until after stxr publishes the address to other threads. So I think you need stlxr (a release-store) in insert.


Doing an extra load or store between LDXR/STXR is not safe. Wikipedia's LL/SC article mentions that some implementations can spuriously fail if you do any load or store between the LL and SC. Wiki says PowerPC specifically does allow this. But AArch64 in general explicitly does not: per the ARM ref manual (see @James Greenhalgh's comment):

LoadExcl/StoreExcl loops are guaranteed to make forward progress only if [...] Between the Load-Exclusive and the Store-Exclusive, there are no explicit memory accesses.

There can be some AArch64 CPUs where you'd create an infinite loop of stxr failure, but there might be others where it does work. If it works in practice on a CPU you test, probably a good idea to see if there's any documentation to back that up.

This is most likely to be a problem if the new_ node happens to be in the same cache line (or LL/SC exclusivity block) as the head node. Make sure you test that case, or make it impossible somehow, if this works at all on the microarchitecture(s) you care about.


Other than that I think your overall algorithm looks correct, so if you've tested it and found it works then that's probably good.

However, I haven't really carefully thought through your algorithm; I also don't have any experience designing or using stuff around raw LL/SC. I know in principle how they work, but I'm not prepared to say this is definitely correct. I don't see any problems from the little bit of thinking about it I've done, but that doesn't mean there aren't any.