OpenCL vector explicit conversion to scalar sequence

108 Views Asked by At

I have been working in a OpenCl program for quite some time and I am stuck into a intriguing point. Important to say that although an enthusiast, I don't have a deep background on coding/ programming.

The program bottleneck, as presented by CodeXL, is the large usage of VGPR registers. Thus, I am rewriting the code to run as much as possible based on scalar operations:

CodeXL statistics printout

I have sucessfully converted a few lines of code from vector based operations to scalar ones. For example:

**((uint8 *)VectorArray)[0] = ((__global uint8 *)InputData)[0];**

To scalar became:

 s0 = ((__global uint *)InputData)[0];  
 s1 = ((__global uint *)InputData)[1];  
 s2 = ((__global uint *)InputData)[2];  
 s3 = ((__global uint *)InputData)[3];  
 s4 = ((__global uint *)InputData)[4];  
 s5 = ((__global uint *)InputData)[5];  
 s6 = ((__global uint *)InputData)[6];  
 s7 = ((__global uint *)InputData)[7];

s0-s7 are further used in a bunch of operations. That worked seamlessly and project got performance just a notch increased. I know I could have accessed directly VectorArray.s0 but "s0" is a local variable with multiple read/write accesses.

Finally, the part I am stuck at is:

((uint4*)VectorArray)[4] = ((__global uint4*)InputData)[4];`

Considering the same logic above, scalar load operation would be:

    s4 = ((__global uint *)InputData)[4];
    s5 = ((__global uint *)InputData)[5]; 
    s6 = ((__global uint *)InputData)[6];  
    s7 = ((__global uint *)InputData)[7]; 

Which, however, fails miserably. I tried over +40 combinations and it seems I am missing some point. Radeon GPU Analyzer states that ISA is a s_load_dwordx4 operation with operands s[4:7], s[6:7], 0x40. I am assuming 0x40 is the referred offset of 64bits and thus position should be offset in the same range. However, one of the trials I already done was to consider s5 = ((__global uint *)InputData)[4]; - which was one of the fails.

ISA documentation is very sparse into the matter and I am pretty much in the dark.

Any hints or comments? Much appreciated.

Thank you. Ed.

1

There are 1 best solutions below

0
On

As far as I remember, the "scalar" registers in AMD GPUs are used for data that's shared between threads. So that's typically things like loop counters, etc., where the compiler can guarantee that the values will be identical in lockstep across threads.

If you're seeing too much vector register pressure, this usually means you are using too much private "memory". For example you've got arrays declared as private, or you have long-lived variables whose "memory" (vector registers) can't be reused for other variables.

I recommend reading AMD's optimisation guide; from what I remember, it goes into some detail on the differences between vector and scalar registers, and what they're used for. The type of code transformation you're suggesting is typically not productive, as you've found. You don't explain what your code does or what all those vector registers are being used for, but depending on use case, you may want to consider moving larger, longer-lived arrays from private to local memory. (bearing in mind this is shared between work-items)