I defined custom structs of 128 bits like this-
typedef struct dtype{
int val;
int temp2;
int temp3;
int temp4;
}dtype;
Then I performed an assignment :-
dtype temp= h_a[i]; //where h_a is dtype *
I was expecting a 128 bit load but instead PTX showed what appears like a 32 bit load operation-
mul.wide.s32 %rd4, %r18, 16;
add.s64 %rd5, %rd1, %rd4;
ld.global.u32 %r17, [%rd5];
Shouldn't it appear like ld.global.v4.u32 %r17, [%rd5];
Where am I going wrong?
The compiler will only emit vectorized load or store instructions if the memory is guaranteed to be aligned to the size of the type, and all the elements of the type are used (otherwise the vector instruction will be optimized away to a scalar instruction to save bandwidth).
If you do this:
you should get something like this:
Here you can clearly see the vectorized load for the aligned type, and the non-vectorized store for the non-aligned type. If the kernel is changed so that the store is to the aligned version:
you will get this:
Now the aligned type is stored with a vectorized instruction.
[ All code compiled for sm_53 using the default Godbolt toolchain (10.2) ]