Atomic Operation in OpenCL kernel

2.5k Views Asked by At

When I try to find more details about the atomic operations in kernel, I found some thing strange. As I understand, when atomic operations are used on one number, then all this kind of operations from all threads will be serialized to launch on this number to keep the integrity. The following is a piece of my kernel code:

    if(atomic_cmpxchg(&A[ptr],0,-1) == -1)
        ptr = A[ptr + 3];

    //To delay
    uint k = 1000000;
    while(k--);

    A[ptr + 3] = newValue;

For the above code, suppose there are only two threads T1 and T2. As I understand, T1 and T2 will both execute the code snippet, but when they try to do the atomic_cmpxchg operation, T2 has to wait for T1 to finish (suppose T1 runs first). As I designed, when T1 read A[ptr], the old value of A[ptr] is 0, so it will be changed to -1 atomically. After that, because for T1, the condition is not satisfied, so T1 will go to the delay code directly and be delayed. Now it's time for T2 to work on A[ptr], because now A[ptr] has been set as -1, so the condition is satisfied for T2, so T2 will run to "ptr= A[ptr + 3];". But my problem is: because after T2 finish the condition judgment, it will execute "ptr= A[ptr + 3];" immediately, but T1 encounter a delay, so the value of A[ptr+3] has not been updated by T1 yet(because k is so big and the delay will be much long). So T2 will not read the up-to-date value of A[ptr+3] which is supposed to be newValue. But my experiment shows that no matter how big I set the k value, the result is always right, i.e. T2 can always read the right value (newValue) no matter how long latency T1 encounters. Can anyone help to look into this case? Many thanks.

1

There are 1 best solutions below

0
On
  1. The compiler is probably smart enough to figure out that your "delay" loop has no side effects and optimizes it away completely.

  2. On GPUs, OpenCL work items from the same work group typically run in lock-step (to a certain degree at least, depending on the exact hardware). This means that both threads execute the same instruction at the same time. They basically share an instruction pointer. In case of divergent control flow, each thread remebers if it is currently active and only executes the current instruction if it is. Atomic operations are still serialized.