Is there a blas implementation using cilkplus array notation?

106 Views Asked by At

To my surprise, I'm not able to track on the web any implementation of BLAS based on cilkplus' array notation. It is strange, because cilkplus should ensure a (more than) decent performance on today's multicore workstation CPUs, coupled to a very expressive and compact representation of the BLAS algorithms. Even more strange, considering that BLAS/LAPACK is the de facto standard for dense matrix calculations (at least, as specification).

I understand that there are other more recent and sofisticate libraries that try to improve/extend the blas/lapack, for example I've looked at eigen and flens, but still it would be nice to have a cilkplus version of the "standard" blas implementation.

Is this depending by a very limited spread of cilkplus?

2

There are 2 best solutions below

0
Arch D. Robison On

http://parallelbook.com/downloads has Cilk Plus code (see "CODE EXAMPLES FROM BOOK") for a few BLAS operations in a Cholesky decomposition example: gemm, portrf, syrk, and trsm. The routines are templates, so they work for any precision.

On the plus side, the Cilk Plus versions give you good composition properties, i.e. you can use them in separate parts of a spawn tree without worry. On the negative side, if you don't need the clean composition, then it's hard to compete with highly tuned parallel BLAS libraries, because the Cilk Plus algorithms tend to be cache oblivious, whereas the highly tuned libraries can exploit cache awareness. E.g., a cache aware algorithm can carefully schedule multiple threads on the same core to work on the same blocks, and thus save memory fetch overhead. It's a lot of work to get the cache awareness right for each machine, but BLAS authors are willing to do the work.

It's exactly the cache awareness ("I own the whole machine" programming) that thwarts clean composition, so you can't have both.

For some BLAS operations, the fork-join structure of Cilk Plus also seems to limit performance compared to less structured parallelism. See slide 2 of http://www.netlib.org/utk/people/JackDongarra/WEB-PAGES/cscads-libtune-09/talk17-knobe.pdf for some examples.

0
maurizio nardò On

Taking gemm as example, at the end the parallel routine is just calling the blas (sgemm, dgemm, etc.) routine. This might be the netlib reference, or atlas, or openblas, or mkl, but this is opaque in the suggested citation. I was asking for the existence of cilkplus implementation of the reference routine, e.g. something like

void dgemm(MATRIX & A, MATRIX & B, MATRIX & C) {    
    #pragma cilk grainsize = 64
    cilk_for(int i = 1; i <= A.rows; i++) {
        double *x = &A(i, 1);
        for (int j = 1; j <= A.cols; j++, x += A.colstride)
            ROW(C, i) += (*x) * ROW(B, j);
    }
}