cuBLAS element-wise multiplication

222 Views Asked by At

Suppose i have two integer arrays in device memory (cuda c code).

Examplex = [1, 2, 4, 8, 16, 32] y = [2, 5, 10, 20, 40, 50]

i want to do element-wise multiplication using cuBLAS.

I tried this and works but i think it is not the point of cuBLAS usage:

for (int i = 0; i < n; i++) {
        cublasSscal(handle, 1, &x[i], &y[i], n);
    }

and then the result is saved in y. Result: y = [2, 10, 40, 160, 640, 1600]

Can i do the above multiplication in cuBLAS without using for loop?

Thanks

I expect avoid the for loop

1

There are 1 best solutions below

0
Robert Crovella On

Suppose i have two integer arrays

Note that cublas doesn't have any options for handling integer data in most cases (excepting certain gemm operations that tap into tensor core, but these only support 8 bit integer or smaller.) If you must use integer data, I would recommend the other approaches below such as writing your own kernel or using thrust.

(I'm just copying my answer from here.)

For floating point data, it’s possible to use the CUBLAS dgmm function to do a vector elementwise multiply:

$ cat t2268.cu
#include <cublas_v2.h>
#include <iostream>

int main(){

  const int ds = 32;

  float *d_a, *d_b, *d_c;
  cudaMalloc(&d_a, sizeof(d_a[0])*ds);
  cudaMalloc(&d_b, sizeof(d_b[0])*ds);
  cudaMalloc(&d_c, sizeof(d_c[0])*ds);
  float *h = new float[ds];
  for (int i = 0; i < ds; i++) h[i] = i+1;
  cudaMemcpy(d_a, h, sizeof(d_a[0])*ds, cudaMemcpyHostToDevice);
  for (int i = 0; i < ds; i++) h[i] = 2;
  cudaMemcpy(d_b, h, sizeof(d_b[0])*ds, cudaMemcpyHostToDevice);
  cublasHandle_t hd;
  cublasStatus_t stat = cublasCreate(&hd);
  cublasSideMode_t mode = CUBLAS_SIDE_LEFT;
  int m = ds;
  int n = 1;
  int lda = ds;
  int incx = 1;
  int ldc = ds;
  stat = cublasSdgmm(hd, mode, m, n, d_a, lda, d_b, incx, d_c, ldc);
  std::cout << (int)stat << std::endl;
  cudaError_t err = cudaMemcpy(h, d_c, sizeof(d_c[0])*ds, cudaMemcpyDeviceToHost);
  std::cout << cudaGetErrorString(err) << std::endl;
  for (int i = 0; i < ds; i++) std::cout << h[i] << std::endl;
}
$ nvcc -o t2268 t2268.cu -lcublas
$ ./t2268
0
no error
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
52
54
56
58
60
62
64
$

However its trivial to write a CUDA kernel to perform this task (it would be a trivial modification to the CUDA vectorAdd sample code, and I expect it would be faster than the above approach.

Also see here for thrust (and dgmm) suggestion.

It looks like it could probably be done with sbmv also.

This operation (regardless of approach used above) could be directly extended to a matrix-matrix elementwise product, simply be treating the matrices as vectors, and may in some settings be referred to as a Hadamard product.