I am using this GPU for my thesis in the university
I am running a lot of different kernels on this thing and the execution time is stuck at 12666.6689 ms, even though I had a loop with 88 instructions * 100m iterations.
__kernel void scalar_mult_add(__global int * list)
{
unsigned int x=38;
unsigned int y=38;
for(int i=0; i<1000000 ; i++){
y=x*y;
x=x+y;
}
}
The only thing that can make the execution time get increase is adding x!=0 inside the for loop statements
__kernel void scalar_mult_add(__global int * list)
{
unsigned int x=38;
unsigned int y=38;
for(int i=0; i<1000000 && x!=0 ; i++){
y=x*y;
x=x+y;
}
}
Why does this thing happens all the time? I can't understand: e.g. 88 million instructions have the same execution time as 1 million instructions, even though I don't have that much units to execute such big kernel at the same time like 1 million instructions.
Why does adding a single x!=0 statement in the loop makes the execution-time increased that much and couple of additions inside the for loop do not?
In the original case the loop doesn't do anything - the output isn't kept and the loop result is not used in any further computation. As is noted in the comments above, the compiler is probably just optimizing out the loop.
Adding the
x != 0to the loop condition check means that the loop result is "used" - you need the previous iteration loop result to determine if you keep iterating. The code is still pointless (no output) but the compiler doesn't see this as dead-code, so it stays in.However, note that Mali-T624 has SIMD vector unit. Writing a dependent scalar loop like this means that you are effectively killing any auto-vectorization in the compiler. Highly recommend using vec4 data types for the computation.
If you want some basic static analysis you might want to look at the Mali Offline Compiler, which is freely downloadable in Arm Mobile Studio. Note that compiling OpenCL kernels requires macOS or Linux, but if you are on Windows you can run the Linux binary under WSL.