How to change sub-matrix of a sparse matrix on CUDA device

256 Views Asked by At

I have a sparse matrix structure that I am using in conjunction with CUBLAS to implement a linear solver class. I anticipate that the dimensions of the sparse matrices I will be solving will be fairly large (on the order of 10^7 by 10^7). I will also anticipate that the solver will need to be used many times and that a portion of this matrix will need be updated several times (between computing solutions) as well.

Copying the entire matrix sturcture from system memory to device memory could become quite a performance bottle neck since only a fraction of the matrix entries will ever need to be changed at a given time.

What I would like to be able to do is to have a way to update only a particular sub-set / sub-matrix rather than recopy the entire matrix structure from system memory to device memory each time I need to change the matrix.

The matrix data structure would reside on the CUDA device in arrays: d_col, d_row, and d_val

On the system side I would have corresponding arrays I, J, and val.

So ideally, I would only want to change the subsets of d_val that correspond to the values in the system array, val, that changed.

Note that I do not anticipate that any entries will be added to or removed from the matrix, only that existing entries will change in value.

Naively I would think that to implement this, I would have an integer array or vector on the host side, e.g. updateInds , that would track the indices of entries in val that have changed, but I'm not sure how to efficiently tell the CUDA device to update the corresponding values of d_val.

In essence: how do I change the entries in a CUDA device side array (d_val) at indicies updateInds[1],updateInds[2],...,updateInds[n] to a new set of values val[updatInds[1]], val[updateInds[2]], ..., val[updateInds[3]], with out recopying the entire val array from system memory into CUDA device memory array d_val?

1

There are 1 best solutions below

0
On

As long as you only want to change the numerical values of the value array associated with CSR (or CSC, or COO) sparse matrix representation, the process is not complicated.

Suppose I have code like this (excerpted from the CUDA conjugate gradient sample):

checkCudaErrors(cudaMalloc((void **)&d_val, nz*sizeof(float)));
...
cudaMemcpy(d_val, val, nz*sizeof(float), cudaMemcpyHostToDevice);

Now, subsequent to this point in the code, let's suppose I need to change some values in the d_val array, corresponding to changes I have made in val:

for (int i = 10; i < 25; i++)
  val[i] = 4.0f;

The process to move these particular changes is conceptually the same as if you were updating an array using memcpy, but we will use cudaMemcpy to update the d_val array on the device:

cudaMemcpy(d_val+10, val+10, 15*sizeof(float), cudaMempcyHostToDevice);

Since these values were all contiguous, I can use a single cudaMemcpy call to effect the transfer.

If I have several disjoint regions similar to above, it will require several calls to cudaMemcpy, one for each region. If, by chance, the regions are equally spaced and of equal length:

for (int i = 10; i < 5; i++)
  val[i] = 1.0f;
for (int i = 20; i < 5; i++)
  val[i] = 2.0f;
for (int i = 30; i < 5; i++)
  val[i] = 4.0f;

then it would also be possible to perform this transfer using a single call to cudaMemcpy2D. The method is outlined here.

Notes:

  1. cudaMemcpy2D is slower than you might expect compared to a cudaMemcpy operation on the same number of elements.
  2. CUDA API calls have some inherent overhead. If a large part of the matrix is to be updated in a scattered fashion, it may still be actually quicker to just transfer the whole d_val array, taking advantage of the fact that this can be done using a single cudaMemcpy operation.
  3. The method described here cannot be used if non-zero values change their location in the sparse matrix. In that case, I cannot provide a general answer for how to surgically update a CSR sparse matrix on the device. And certain relatively simple changes could necessitate updating most of the array data (3 vectors) anyway.