I was wondering how to interpret the following OpenMP constructs:
#pragma omp target teams distribute parallel for
for(int i = 0; i < N; ++i) {
// compute
}
#pragma omp target teams distribute parallel for simd
for(int i = 0; i < N; ++i) {
// compute
}
Note the simd
clause added on the second loop. According to the OpenMP 5.1 specification, this clause declare that: "multiple iterations of the loop can be executed concurrently by using SIMD instructions".
I believe I can very well conceive how simd
is implemented and behaves on CPU but on GPU, more precisely, AMD GPUs, there is no such thing as exposed SIMD instruction in the sense that a HIP thread is in fact a SIMD instruction lane.
According to the OpenMP specification, if there is a loop carried dependency or if the compiler can not prove there is none, when OpenMP maps the teams to thread blocks/workgroups and the treads to simd lanes it is forced to use thread blocks of only one thread.
How do you interpret the target teams distribute parallel for simd
:
- Does it mean that in this context
simd
can't be translated for a GPU? - Or maybe - each thread is handled as if it had a single SIMD lane?
There is at least one similar but old and unanswered question: How is omp simd for loop executed on GPUs?
According to the test case below, the assembly generated for AMD MI250 (gfx90a) is the same with or without
simd
. Though, if you look at the CPU code, you shall see a significant change with thesimd
clause which in this case, allows for a similar optimization to the ones observed with an explicit usage of therestrict
keyword.TLDR: Currently, the
simd
clause is irrelevant and only leads to this warning, even for extremely trivial cases:loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
.Code available at: https://godbolt.org/z/sMY48s8jz