I have a sparse matrix structure that I am using in conjunction with CUBLAS to implement a linear solver class. I anticipate that the dimensions of the sparse matrices I will be solving will be fairly large (on the order of 10^7 by 10^7). I will also anticipate that the solver will need to be used many times and that a portion of this matrix will need be updated several times (between computing solutions) as well.
Copying the entire matrix sturcture from system memory to device memory could become quite a performance bottle neck since only a fraction of the matrix entries will ever need to be changed at a given time.
What I would like to be able to do is to have a way to update only a particular sub-set / sub-matrix rather than recopy the entire matrix structure from system memory to device memory each time I need to change the matrix.
The matrix data structure would reside on the CUDA device in arrays: d_col, d_row, and d_val
On the system side I would have corresponding arrays I, J, and val.
So ideally, I would only want to change the subsets of d_val that correspond to the values in the system array, val, that changed.
Note that I do not anticipate that any entries will be added to or removed from the matrix, only that existing entries will change in value.
Naively I would think that to implement this, I would have an integer array or vector on the host side, e.g. updateInds , that would track the indices of entries in val that have changed, but I'm not sure how to efficiently tell the CUDA device to update the corresponding values of d_val.
In essence: how do I change the entries in a CUDA device side array (d_val) at indicies updateInds[1],updateInds[2],...,updateInds[n] to a new set of values val[updatInds[1]], val[updateInds[2]], ..., val[updateInds[3]], with out recopying the entire val array from system memory into CUDA device memory array d_val?
As long as you only want to change the numerical values of the value array associated with CSR (or CSC, or COO) sparse matrix representation, the process is not complicated.
Suppose I have code like this (excerpted from the CUDA conjugate gradient sample):
Now, subsequent to this point in the code, let's suppose I need to change some values in the
d_val
array, corresponding to changes I have made inval
:The process to move these particular changes is conceptually the same as if you were updating an array using
memcpy
, but we will usecudaMemcpy
to update thed_val
array on the device:Since these values were all contiguous, I can use a single
cudaMemcpy
call to effect the transfer.If I have several disjoint regions similar to above, it will require several calls to
cudaMemcpy
, one for each region. If, by chance, the regions are equally spaced and of equal length:then it would also be possible to perform this transfer using a single call to
cudaMemcpy2D
. The method is outlined here.Notes:
cudaMemcpy2D
is slower than you might expect compared to acudaMemcpy
operation on the same number of elements.d_val
array, taking advantage of the fact that this can be done using a singlecudaMemcpy
operation.