How do I perform a threaded sparse matrix - vector multiplication using MKL?

1.1k Views Asked by At

I need to perform a matrix vector multiplication, where the matrix is complex, symmetric and has four off-diagonal non-zero bands. So far I am using the sparse BLAS routine mkl_zdiasymv to perform the multiplication and it works fine on one core. I would like to try if I can gain a performance boost by using multi-threading (e.g. openMP). As far as I have understood some (many?) of the MKL routines are threaded. However if I use mkl_set_num_threads(4) my program still runs on one single thread.

To give a specific example here is a little test program which I compile (using icc 14.01) with:

icc mkl_test_mp.cpp -mkl -std=c++0x -openmp

mkl_test_mp.cpp:

#include <complex>
#include <vector>
#include <iostream>
#include <chrono>

typedef std::complex<double> complex;
using std::vector;
using namespace std::chrono;

#define MKL_Complex16 std::complex<double>
#include "mkl.h"

int vector_dimension = 10000000; 
int number_of_multiplications = 100;

vector<complex> initialize_matrix() {

    complex value_main_diagonal          = complex(1, 2);
    complex value_sub_and_super_diagonal = complex(3, 4);
    complex value_far_off_diagonal       = complex(5, 6);

    std::vector<complex> matrix;
    matrix.resize(1 * vector_dimension, value_main_diagonal);
    matrix.resize(2 * vector_dimension, value_sub_and_super_diagonal);
    matrix.resize(3 * vector_dimension, value_far_off_diagonal);

    return matrix;
}

vector<complex> perform_matrix_vector_calculation(vector<complex>& matrix, const vector<complex>& x) {

    mkl_set_num_threads(4);

    vector<complex> result(vector_dimension);

    char uplo = 'L';   // since the matrix is symmetric we only need to declare one triangular part of the matrix (here the lower one)
    int number_of_nonzero_diagonals = 3;
    vector<int> matrix_diagonal_offsets = {0, -1, -int(sqrt(vector_dimension))};

    complex *x_data = const_cast<complex* >(x.data()); // I do not like this, but mkl expects non const pointer (??)

    mkl_zdiasymv (
            &uplo,
            &vector_dimension,
        matrix.data(),
        &vector_dimension,
        matrix_diagonal_offsets.data(),
        &number_of_nonzero_diagonals,
        x_data,
        result.data()
    );
    return result;
}

void print(vector<complex>& x) {
  for(complex z : x)
    std::cerr << z;
  std::cerr << std::endl;
}

void run() {
  vector<complex> matrix = initialize_matrix();
  vector<complex> current_vector(vector_dimension, 1);

  for(int i = 0; i < number_of_multiplications; ++i) {
      current_vector = perform_matrix_vector_calculation(matrix, current_vector);
  }
  std::cerr << current_vector[0] << std::endl;
}

int main() {

  auto start = steady_clock::now();

  run();

  auto end = steady_clock::now();
  std::cerr << "runtime = " << duration<double, std::milli> (end - start).count() << " ms" << std::endl;
  std::cerr << "runtime per multiplication = " << duration<double, std::milli> (end -     start).count()/number_of_multiplications << " ms" << std::endl;
  }

Is it even possible to parallelize this way ? What am I doing wrong ? Are there other suggestions to speed up the multiplication ?

1

There are 1 best solutions below

0
On

Since you are not showing how you compile the code, could you check that you are linking against the multi threaded Intel MKL libs and e.g. pthreads?

For example (this is for an older version of MKL):

THREADING_LIB="$(MKL_PATH)/libmkl_$(IFACE_THREADING_PART)_thread.$(EXT)"
OMP_LIB = -L"$(CMPLR_PATH)" -liomp5

There should be an examples directory in your MKL distribution e.g. intel/composer_xe_2011_sp1.10.319/mkl/examples. In there you can check the contents of spblasc/makefile to see how to correctly link against the multithreaded libs for you particular version of MKL.

Another suggestion that should speed things up is adding compiler optimisation flags e.g.

OPT_FLAGS = -xHost -O3

to allow icc to generate optimised code for your architecture so your line would end up as:

icc mkl_test_mp.cpp -mkl -std=c++0x -openmp -xHost -O3