The AVX512CD instruction families are: VPCONFLICT, VPLZCNT and VPBROADCASTM.
The Wikipedia section about these instruction says:
The instructions in AVX-512 conflict detection (AVX-512CD) are designed to help efficiently calculate conflict-free subsets of elements in loops that normally could not be safely vectorized.
What are some examples that show these instruction being useful in vectorizing loops? It would be helpful if answers will include scalar loops and their vectorized counterparts.
Thanks!
One example where the CD instructions might be useful is histogramming. For scalar code histogramming is just a simple loop like this:
Normally you can't vectorize histogramming because you might have the same bin index more than once in a vector - you might naïvely try something like this:
but if any of the indices within a vector are the same then you get a conflict, and the resulting bin update will be incorrect.
So, CD instructions to the rescue:
In practice this example is quite inefficient and no better than scalar code, but there are other more compute-intensive examples where using the CD instructions seems to be worthwhile. Typically these will be simulations where the data elements are going to be updated in a non-deterministic fashion. One example (from the LAMMPS Molecular Dynamics Simulator) is referred to in the KNL book by Jeffers et al.