Would false sharing happen in the following program?
Memory
- 1 array divided into 4 equal regions:
[A1, A2, B1, B2]
- The whole array can fit into L1 cache in the actual program.
- Each region is padded to be a multiple of 64 bytes.
Steps
1. thread 1 write to region A1 and A2 while thread 2 write to region B1 and B2.
2. barrier
3. thread 1 read B1 and write to A1 while thread 2 read B2 and write to A2.
4. barrier
5. Go to step 1.
Test
#include <vector>
#include <iostream>
#include <stdint.h>
int main() {
int N = 64;
std::vector<std::int32_t> x(N, 0);
#pragma omp parallel
{
for (int i = 0; i < 1000; ++i) {
#pragma omp for
for (int j = 0; j < 2; ++j) {
for (int k = 0; k < (N / 2); ++k) {
x[j*N/2 + k] += 1;
}
}
#pragma omp for
for (int j = 0; j < 2; ++j) {
for (int k = 0; k < (N/4); ++k) {
x[j*N/4 + k] += x[N/2 + j*N/4 + k] - 1;
}
}
}
}
for (auto i : x ) std::cout << i << " ";
std::cout << "\n";
}
Result
32 elements of 500500 (1000 * 1001 / 2)
32 elements of 1000
There is some false sharing in your code since
x
is not guaranteed to be aligned to a cache-line. Padding is not necessarily enough. In your exampleN
is really small which may be a problem. Note at your exampleN
, the biggest overhead would probably be worksharing and thread management. IfN
is sufficiently large, i.e.array-size / number-of-threads >> cache-line-size
, false sharing is not a relevant problem.Alternating writes to
A2
from different threads in your code is also not optimal in terms of cache usage, but that is not a false sharing issue.Note, you do not need to split the loops. If you access index into memory contiguously in a loop, one loop is just fine, e.g.
If you are really careful you may add
schedule(static)
, then you have a guarantee of an even contiguous word distribution.Remember that false sharing is a performance issue, not a correctness problem, and only relevant if it occurs frequently. Typical bad patterns are writes to
vector[my_thread_index]
.