I have an application where reduce operations (like sum, max) on a large matrix are bottleneck. I need to make this as fast as possible. Are there vector instructions in mkl to do that?
Is there a special hardware unit to deal with it on xeon cpu, gpu or mic?
How are reduce operations implemented in these hardware in general?
Turns out none of the hardware have reduce operation circuit built-in. I imagined a sixteen 17 bit adders attached to 128 bit vector register for reduce-sum operation. Maybe this is because no one has encountered a significant bottleneck with reduce operation. Well, the best solution i found is
#pragma omp parallel for reductionin openmp. I am yet to test its performance though.