CUDA has a nice set of SIMD instructions for integers that allow efficient SIMD computations. Among those, there are some that compute addition and subtraction per byte or per half-word (like __vadd2 and __vadd4), however, I couldn't find a similar function that computes per-byte multiplication for a 32bit register. I would appreciate it if someone can help me find a proper solution.
cuda SIMD instruction for per-byte multiplication with unsigned saturation
535 Views Asked by user2348209 At
2
There are 2 best solutions below
0

There is no existing intrinsic __vmulus8()
in CUDA. However, it can be emulated using existing intrinsics. Basically, we can pack the four 16-bit products of four 8-bit quantities using two 32-bit variable to hold them. Then clamp each product to 255 and extract the least-significant byte of each product into the final result with the help of the permute operation. The code generated by CUDA 11 for compute capabilities >= 7.0 looks reasonable. Whether the performance is sufficient will depend on the use case. If this operation occurs in the middle of a processing pipeline computing with packed bytes, that should be the case.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
/* byte-wise multiply with unsigned saturation */
__device__ unsigned int vmulus4 (unsigned int a, unsigned int b)
{
unsigned int plo, phi, res;
// compute products
plo = ((a & 0x000000ff) * (b & 0x000000ff) +
(a & 0x0000ff00) * (b & 0x0000ff00));
phi = (__umulhi (a & 0x00ff0000, b & 0x00ff0000) +
__umulhi (a & 0xff000000, b & 0xff000000));
// clamp products to 255
plo |= __vcmpne2 (plo & 0xff00ff00, 0x00000000);
phi |= __vcmpne2 (phi & 0xff00ff00, 0x00000000);
// extract least significant eight bits of each product
res = __byte_perm (plo, phi, 0x6420);
return res;
}
__global__ void kernel (unsigned int a, unsigned int b, unsigned int *res)
{
*res = vmulus4 (a, b);
}
unsigned int vmulus4_ref (unsigned int a, unsigned int b)
{
unsigned char a0, a1, a2, a3, b0, b1, b2, b3;
unsigned int p0, p1, p2, p3;
a0 = (a >> 0) & 0xff;
a1 = (a >> 8) & 0xff;
a2 = (a >> 16) & 0xff;
a3 = (a >> 24) & 0xff;
b0 = (b >> 0) & 0xff;
b1 = (b >> 8) & 0xff;
b2 = (b >> 16) & 0xff;
b3 = (b >> 24) & 0xff;
p0 = (unsigned int)a0 * (unsigned int)b0;
p1 = (unsigned int)a1 * (unsigned int)b1;
p2 = (unsigned int)a2 * (unsigned int)b2;
p3 = (unsigned int)a3 * (unsigned int)b3;
if (p0 > 255) p0 = 255;
if (p1 > 255) p1 = 255;
if (p2 > 255) p2 = 255;
if (p3 > 255) p3 = 255;
return (p0 << 0) + (p1 << 8) + (p2 << 16) + (p3 << 24);
}
// George Marsaglia's KISS PRNG, period 2**123. Newsgroup sci.math, 21 Jan 1999
// Bug fix: Greg Rose, "KISS: A Bit Too Simple" http://eprint.iacr.org/2011/007
static uint32_t kiss_z=362436069, kiss_w=521288629;
static uint32_t kiss_jsr=123456789, kiss_jcong=380116160;
#define znew (kiss_z=36969*(kiss_z&65535)+(kiss_z>>16))
#define wnew (kiss_w=18000*(kiss_w&65535)+(kiss_w>>16))
#define MWC ((znew<<16)+wnew )
#define SHR3 (kiss_jsr^=(kiss_jsr<<13),kiss_jsr^=(kiss_jsr>>17), \
kiss_jsr^=(kiss_jsr<<5))
#define CONG (kiss_jcong=69069*kiss_jcong+1234567)
#define KISS ((MWC^CONG)+SHR3)
int main (void)
{
unsigned int *resD = 0;
unsigned int a, b, res, ref;
cudaMalloc ((void**)&resD, sizeof resD[0]);
for (int i = 0; i < 1000000; i++) {
a = KISS;
b = KISS;
kernel<<<1,1>>>(a, b, resD);
cudaMemcpy (&res, resD, sizeof res, cudaMemcpyDeviceToHost);
ref = vmulus4_ref (a, b);
if (res != ref) {
printf ("error: a=%08x b=%08x res=%08x ref=%08x\n", a, b, res, ref);
return EXIT_FAILURE;
}
}
cudaFree (resD);
return EXIT_SUCCESS;
}
There isn't one that returns the 4 individual products.
The closest is the
__dp4a()
intrinsic which returns the sum of the 4 products, in a 32-bit integer.You could write an 8-bit packed unsigned multiply with saturation like this:
The SASS code appears to be about as I would expect, roughly the same length as the C++ code, ignoring the
LDC
andSTG
instructions.FWIW, on Tesla V100, CUDA 11.4, the implementation by njuffa and mine are pretty close in terms of register usage (njuffa: 16, mine: 17) and performance (njuffa about 1% faster):
Later: Here is a slightly faster routine (a few percent, on
sm_70
) compared to my previous:It has the disadvantage that it uses CUDA half-precision intrinsics, so it is "less portable" than the previous, and likewise cannot be decorated with
__host__
.