I have the following logic:
struct Range {
int start;
int end;
};
bool prev = false;
Range range;
std::vector<Range> result;
for (int i = 0; i < n; i++) {
bool curr = ...; // this is random and causes prediction misses
if (curr && !prev)
range.start = i;
if (!curr && prev){
range.end = i-1;
result.push_back(range);
}
prev = curr;
}
// handles edge case at the end
How can I remove branching in this code? The branching pattern is like run-length encoding, but we're storing start/end of runs of true1.
I tried storing the boolean results as 1s and 0s so that std::adjacent_difference could give me an array of ones and zeros, where a one corresponds to a start and a zero corresponds to an end, but how do I construct results from here?
Footnote 1: classic RLE would store narrow (e.g. 8-bit) lengths for both false and true, so position is implicit. But also an explicit value so long runs could be encoded as multiple runs of 255, or 127 if we used 1-byte entries with one bit holding the true/false value and the other 7 holding a length.
A technique you can use is to unconditionally store a
Rangeto the result in every iteration, but only conditionally increment the index at which you store it. That conditional increment can be performed with a conditional move or plain arithmetic (of course there is no guarantee that it will be, but it's at least possible, check the resulting assembly code), unlike a conditionalpush_back. Ranges that wouldn't be stored in the original code, are now stored but overwritten in the next iteration. Ranges that would be stored, are kept by incrementing the index so that the next store doesn't overwrite it.In order to do this, the vector needs sufficient size first (it doesn't need to be quite as high as
nI think, but you can figure out those details), otherwise the writes are out of bounds. You can set the size to the actual size in the end.For example,
But is it branchless?. Apparently yes. Clang is a bit more forgiving in this respect than GCC and also made branchless code if we don't use manual arithmetic tricks but the manual trickery was needed for GCC.
This also may not be more efficient, depending on how predictable the branches that are avoided this way were.
Clang 15 apparently turned the conditional update of the
startinto this:So using conditional moves instead of branches.
And the conditional increment of the index into:
Depending on where the values of
currcome from it may even be possible to use SIMD for this. Here's a sketch:The start indices and end indices can be filtered out separately and "compressed" (keeping only the indices for which
curr && !prevor!curr && prevrespectively) separately, then they will naturally "pair up". The locations with a start or end could be found by simple mask manipulation, usevpcompressd(with AVX512) or suitable emulation based onpshufb(pre-AVX512) or whatever your SIMD ISA of choice exposes to "compress" (aka left pack) the indices, do an unaligned store, and increase the destination pointer bystd::popcount(mask).