Before I elaborate the specifics, I have the following function,
Let _e, _w be an array of equal size. Let _stepSize be of float type.
void GradientDescent::backUpWeights(FLOAT tdError) {
AI::FLOAT multiplier = _stepSize * tdError;
for (UINT i = 0; i < n; i++){
_w[i] += _e[i]*multiplier
}
// Assumed that the tilecode ensure that _w.size() or _e.size() is even.
}
This function is well and fine, but if a cpu have intrinsic, specifically for this example, SSE4, then function below allows me to shave seconds off (for the same input) even with -O3 gcc flag already included for both and extra -msse4a added for this one.
void GradientDescent::backUpWeights(FLOAT tdError) {
AI::FLOAT multiplier = _stepSize * tdError;
__m128d multSSE = _mm_set_pd(multiplier, multiplier);
__m128d* eSSE = (__m128d*)_e;
__m128d* wSSE = (__m128d*)_w;
size_t n = getSize()>>1;
for (UINT i = 0; i < n; i++){
wSSE[i] = _mm_add_pd(wSSE[i],_mm_mul_pd(multSSE, eSSE[i]));
}
// Assumed that the tilecode ensure that _w.size() or _e.size() is even.
}
Problem:
My problem now is I want something like this,
void GradientDescent::backUpWeights(FLOAT tdError) {
AI::FLOAT multiplier = _stepSize * tdError;
#ifdef _mssa4a_defined_
__m128d multSSE = _mm_set_pd(multiplier, multiplier);
__m128d* eSSE = (__m128d*)_e;
__m128d* wSSE = (__m128d*)_w;
size_t n = getSize()>>1;
for (UINT i = 0; i < n; i++){
wSSE[i] = _mm_add_pd(wSSE[i],_mm_mul_pd(multSSE, eSSE[i]));
}
#else // No intrinsic
for (UINT i = 0; i < n; i++){
_w[i] += _e[i]*multiplier
}
#endif
// Assumed that the tilecode ensure that _w.size() or _e.size() is even.
}
Thus, if in gcc, I declared -msse4a to compile this code, then it will pick compile the code in if statement. And of course, my plan is to implement it for all intrinsic, not just for SSE4A above.
GCC, ICC (on Linux), and Clang have the following compile options with corresponding defines
Options and defines in GCC and Clang but not in ICC:
AVX512 options which are defined in recent versions of GCC, Clang, and ICC
AVX512 options which will likely be in GCC, Clang, and ICC soon (if not already):
Note that many of these switches enable several more: e.g
-mfma
enables and defines AVX2, AVX, SSE4.2 SSE4.1, SSSE3, SSE3, SSE2, SSE.I'm not 100% what the compiler options with ICC for AVX512 is. It could be
-xMIC-AVX512
instead of-mavx512f
.MSVC only appears to define __AVX__ and __AVX2__.
In your case your code appears only to be using SSE2 so if you compile in 64-bit mode (which is the default in a 64-bit user space or explicitly with
-m64
) then__SSE2__
is defined. But since you used-msse4a
then__SSE4A__
will be defined as well.Note that enabling an instruction is not the same as determine if an instruction set is available. If you want your code to work on multiple instruction sets then I suggest a CPU dispatcher.