Ways to accelerate reduce operation on Xeon CPU, GPU and Xeon Phi

232 Views Asked by hrs At 22 July 2014 at 16:21

I have an application where reduce operations (like sum, max) on a large matrix are bottleneck. I need to make this as fast as possible. Are there vector instructions in mkl to do that?

Is there a special hardware unit to deal with it on xeon cpu, gpu or mic?

How are reduce operations implemented in these hardware in general?

Original Q&A

There are 3 best solutions below

hrs On 23 July 2014 at 09:54

Turns out none of the hardware have reduce operation circuit built-in. I imagined a sixteen 17 bit adders attached to 128 bit vector register for reduce-sum operation. Maybe this is because no one has encountered a significant bottleneck with reduce operation. Well, the best solution i found is #pragma omp parallel for reduction in openmp. I am yet to test its performance though.

amckinley On 24 July 2014 at 12:46

You can implement your own simple reductions using the KNC vpermd and vpermf32x4 instructions as well as the swizzle modifiers to do cross lane operations inside the vector units.

The C intrinsic function equivalents of these would be the mm512{mask}permute* and mm512{mask}swizzle* family.

However, I recommend that you first look at the array notation reduce operations, that already have high performance implementations on the MIC.

Look at the reduction operations available here and also check out this video by Taylor Kidd from Intel talking about array notation reductions on the Xeon Phi starting at 20mins 30s.

EDIT: I noticed you are also looking for CPU based solutions. The array notation reductions will work very well on the Xeon also.

Jeff Hammond On 01 September 2014 at 00:02

This operation is going to be bandwidth-limited and thus vectorization almost certainly doesn't matter. You want the hardware with the most memory bandwidth. An Intel Xeon Phi processor has more aggregate bandwidth (but not bandwidth-per-core) than a Xeon processor.

Ways to accelerate reduce operation on Xeon CPU, GPU and Xeon Phi

There are 3 best solutions below

Related Questions in HPC

Related Questions in INTEL-MKL

Related Questions in INTEL-MIC

Related Questions in XEON-PHI

Trending Questions

Popular # Hahtags

Popular Questions