Atomic, scalable, monotonic counter with boundary

651 Views Asked by At

I have a critical code path where threads use an atomic increment on an integer to count the number of events that have happened globally. This is reasonably fast, but still requires that the cache line holding the integer bounces between cores. In a NUMA system, this creates a lot of MESI traffic.

The pseudo code of the hot pat is that all threads do this:

const int CHECK_VALUE = 42;

int counterNew = counter++;
if (counterNew == CHECK_VALUE) {
  Do Final work
}

The counter is monotonically increasing and the value it must reach is known in advance.

At least one thread must conclude that the global counter has reached CHECK_VALUE after it has incremented counter. It is acceptable that more than one thread draws that conclusion (I can always synchronise them at that point - as that is no longer the hot path).

Is it possible to do better than using atomic increment to track the value of counter if I know it is monotonic and the final value is known?

2

There are 2 best solutions below

0
On

Without synchronization, it is possible that the counter will stay stuck at 0. In reality it will not have that race condition nearly so often, so the counter will be roughly accurate. I think you can prove that no value will be skipped in the counter sequence: it is not possible to change the counter to 2 if it was not previously 1, which applies to every value the counter might hold. So a global counter using ++ instead of atomic increment would work if it were ok to miss a few events. However, even unsynchronized, this is still going to cause some of the memory issues you want to avoid (resynchronizing cache lines accross CPU's).

Another way to do it is to poll. Each thread could count it's events in its own private data. Another thread could poll once a minute to see if the # of events is > threshold.

Another way you could do it is to bump an internal counter in the thread data, and when that hits 10, bump the global counter. This would cut the # of global increments by 10.

Another way would be to bump an internal counter in a thread. Do the synchronization whenever an individual thread has reached cEvents/threadcount.

Another way would be to bump an internal counter in a thread. Whenever an individual thread has reached some limit, check the other thread counts to see if together they > threadcount. This is about the same as using a polling thread, but without using another thread.

There are lots of ways to do stuff like this with private counters. It all depends on the accuracy you need.

1
On

You can do it with atomic CAS operation (compare ans swap). On i386 architecture, this is instruction CMPXCHG. If needed, you can use small assembly function, implements CAS on your platform, or ask me here about Intel implementation. Your code must be following:

int local_cnt;
// Atomic increment counter 
do {
  local_cnt = counter;
} while(cas(&counter, local_cnt, local_cnt + 1) != local_cnt);

// check old counter value
if(local_cnt == CHECK_VALUE) { 
  // do something
}