I have a critical code path where threads use an atomic increment on an integer to count the number of events that have happened globally. This is reasonably fast, but still requires that the cache line holding the integer bounces between cores. In a NUMA system, this creates a lot of MESI traffic.
The pseudo code of the hot pat is that all threads do this:
const int CHECK_VALUE = 42;
int counterNew = counter++;
if (counterNew == CHECK_VALUE) {
Do Final work
}
The counter is monotonically increasing and the value it must reach is known in advance.
At least one thread must conclude that the global counter has reached CHECK_VALUE
after it has incremented counter
. It is acceptable that more than one thread draws that conclusion (I can always synchronise them at that point - as that is no longer the hot path).
Is it possible to do better than using atomic increment to track the value of counter
if I know it is monotonic and the final value is known?
Without synchronization, it is possible that the counter will stay stuck at 0. In reality it will not have that race condition nearly so often, so the counter will be roughly accurate. I think you can prove that no value will be skipped in the counter sequence: it is not possible to change the counter to 2 if it was not previously 1, which applies to every value the counter might hold. So a global counter using ++ instead of atomic increment would work if it were ok to miss a few events. However, even unsynchronized, this is still going to cause some of the memory issues you want to avoid (resynchronizing cache lines accross CPU's).
Another way to do it is to poll. Each thread could count it's events in its own private data. Another thread could poll once a minute to see if the # of events is > threshold.
Another way you could do it is to bump an internal counter in the thread data, and when that hits 10, bump the global counter. This would cut the # of global increments by 10.
Another way would be to bump an internal counter in a thread. Do the synchronization whenever an individual thread has reached cEvents/threadcount.
Another way would be to bump an internal counter in a thread. Whenever an individual thread has reached some limit, check the other thread counts to see if together they > threadcount. This is about the same as using a polling thread, but without using another thread.
There are lots of ways to do stuff like this with private counters. It all depends on the accuracy you need.