Is there a way to use "unified memory" (MAGMA) with 2 GPU cards with NVLink and 1TB RAM

559 Views Asked by At

At work, On Debian 10, I have 2 GPU cards RTX A6000 with NVlink harware component with 1TB of RAM and I would like to benefit of the potential combined power of both cards and 1TB RAM.

Currently, I have the following magma.make invoked by a Makefile :

CXX = nvcc -std=c++17 -O3
LAPACK = /opt/intel/oneapi/mkl/latest
LAPACK_ANOTHER=/opt/intel/mkl/lib/intel64
MAGMA = /usr/local/magma
INCLUDE_CUDA=/usr/local/cuda/include
LIBCUDA=/usr/local/cuda/lib64

SEARCH_DIRS_INCL=-I${MAGMA}/include -I${INCLUDE_CUDA} -I${LAPACK}/include
SEARCH_DIRS_LINK=-L${LAPACK}/lib/intel64 -L${LAPACK_ANOTHER} -L${LIBCUDA} -L${MAGMA}/lib

CXXFLAGS = -c -DMAGMA_ILP64 -DMKL_ILP64 -m64 ${SEARCH_DIRS_INCL}

LDFLAGS = ${SEARCH_DIRS_LINK} -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lcuda -lcudart -lcublas -lmagma -lpthread -lm -ldl 

SOURCES = main_magma.cpp XSAF_C_magma.cpp
EXECUTABLE = main_magma.exe

When I execute my code, I have memory errors since in this code, I try to inverse matrixes of size 120k x 120k.

If we lookt at closer, 120k x 120k matrixes requires in double precision : 120k x 120k x 8 bytes, so alsmost 108GB.

The functions implied can't accept single precision.

Unfortunately, I have 2 NVIDIA GPU cards of 48GB each one :

Question :

Is there a way, from a computation point of view or, from a coding point of view, to merge the 2 memory of 2 GPU cards (that would give 96GB) in order to inverse these large matrixes ?

I am using MAGMA to compile and for routine of inversion like this :

// ROUTINE MAGMA IMPLEMENTED
void matrix_inverse_magma(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {

  // Index for loop and arrays
  int i, j, ip, idx;

  // Start magma part
  magma_int_t m = F_matrix.size();
  if (m) {
  magma_init (); // initialize Magma
  magma_queue_t queue=NULL;
  magma_int_t dev=0;
  magma_queue_create(dev ,&queue );
  double gpu_time , *dwork; // dwork - workspace
  magma_int_t ldwork; // size of dwork
  magma_int_t *piv, info; // piv - array of indices of inter -
  magma_int_t mm=m*m; // size of a, r, c
  double *a; // a- mxm matrix on the host
  double *d_a; // d_a - mxm matrix a on the device
  double *d_c; // d_c - mxm matrix c on the device
 
  magma_int_t ione = 1;
  magma_int_t ISEED [4] = { 0,0,0,1 }; // seed
  magma_int_t err;
  const double alpha = 1.0; // alpha =1
  const double beta = 0.0; // beta=0
  ldwork = m * magma_get_dgetri_nb( m ); // optimal block size
  // allocate matrices
  err = magma_dmalloc_cpu( &a , mm ); // host memory for a

  for (i = 0; i<m; i++){
    for (j = 0; j<m; j++){
      idx = i*m + j;
      a[idx] = F_matrix[i][j];
      //cout << "a[" << idx << "]" << a[idx] << endl;
    }
  }
  err = magma_dmalloc( &d_a , mm ); // device memory for a
  err = magma_dmalloc( &dwork , ldwork );// dev. mem. for ldwork
  piv=( magma_int_t *) malloc(m*sizeof(magma_int_t ));// host mem.

  magma_dsetmatrix( m, m, a, m, d_a, m, queue); // copy a -> d_a

  magma_dgetrf_gpu( m, m, d_a, m, piv, &info);
  magma_dgetri_gpu(m, d_a, m, piv, dwork, ldwork, &info);

  magma_dgetmatrix( m, m, d_a , m, a, m, queue); // copy d_a ->a

  for (i = 0; i<m; i++){
    for (j = 0; j<m; j++){
      idx = i*m + j;
      F_output[i][j] = a[idx];
    }
  }
  // SAVE ORIGINAL
  free(a); // free host memory
  free(piv); // free host memory
  magma_free(d_a); // free device memory
  magma_queue_destroy(queue); // destroy queue
  magma_finalize (); 
  // End magma part
  }
}

If this is not possible to do it directly with NVlink harware component between both GPU cards, which workaround could we find to allow this matrix inversion ?

Edit :

I was told by a HPC engineer :

"The easiest way will be to use the Makefiles until we figure out how cmake can support that. If you do that, you can just replace LAPACKE_dgetrf by magma_dgetrf. MAGMA will use internally one GPU with out-of-memory algorithm that fill factor the matrix, even if it is large and does not fir into the memory of the GPU."

Does it mean that I have to find the appropriate flags of Makefile to be able to use magma_dgetrf instead of LAPACKE_dgetrf ?

And for the second sentence, it is said that

"MAGMA will use internally one GPU with out-of-memory algorithm that fill factor the matrix"

Does it mean that if my matrix is over 48GB, then MAGMA will be able to fill the rest into the second GPU A6000 or in the RAM and perform the inversion of the full matrix ?

Please, let me know which flags to use to build correctly MAGMA in my case.

Currrently, I do :

$ mkdir build && cd build
$ cmake -DUSE_FORTRAN=ON  \
-DGPU_TARGET=Ampere \
-DLAPACK_LIBRARIES="/opt/intel/oneapi/intelpython/latest/lib/liblapack.so" \
-DMAGMA_ENABLE_CUDA=ON ..
$ cmake --build . --config Release
1

There are 1 best solutions below

2
On

I am not an expert in GP/GPU computation, but I would be very surprised if you could combine two compute devices into a single device. At least I don't think it's possible using a standard library. If you think about it, it sort of defeats the purpose of using a GPU in the first place.

However, I would say that once you use very large matrices you hit many problems, which make a text-book inverse operation numerically unstable. The normal way around this is instead to never store an inverse matrix, at all. Often you only require an inverse matrix to be able to solve

Ax = b (solve for x)
Ax - b = 0 (homogenous form)

Which can be solved without inverse-A

I would suggest that you need to start by reading the inverse-matrix chapter of Numerical Recipes in C/C++. This is a standard text, with example code, and is widely available from Amazon, etc. These texts assume CPU implementation, but...

Once you understand these algorithms, you may (or may not) find that being able to issue two parallel non-inverse matrix operations is useful to you. However the algorithms described in this (and other texts) are orders of magnitude faster than any brute force operation anyway.