How to avoid branching when finding runs of the same value and storing as a range (like run-length encoding)?

134 Views Asked by At

I have the following logic:

struct Range {
  int start;
  int end;
};

bool prev = false;
Range range;
std::vector<Range> result;
for (int i = 0; i < n; i++) {
  bool curr = ...; // this is random and causes prediction misses
  if (curr && !prev)
    range.start = i;
  if (!curr && prev){
    range.end = i-1;
    result.push_back(range);
  }
  prev = curr;
}
// handles edge case at the end

How can I remove branching in this code? The branching pattern is like run-length encoding, but we're storing start/end of runs of true1.

I tried storing the boolean results as 1s and 0s so that std::adjacent_difference could give me an array of ones and zeros, where a one corresponds to a start and a zero corresponds to an end, but how do I construct results from here?

Footnote 1: classic RLE would store narrow (e.g. 8-bit) lengths for both false and true, so position is implicit. But also an explicit value so long runs could be encoded as multiple runs of 255, or 127 if we used 1-byte entries with one bit holding the true/false value and the other 7 holding a length.

3

There are 3 best solutions below

6
harold On

A technique you can use is to unconditionally store a Range to the result in every iteration, but only conditionally increment the index at which you store it. That conditional increment can be performed with a conditional move or plain arithmetic (of course there is no guarantee that it will be, but it's at least possible, check the resulting assembly code), unlike a conditional push_back. Ranges that wouldn't be stored in the original code, are now stored but overwritten in the next iteration. Ranges that would be stored, are kept by incrementing the index so that the next store doesn't overwrite it.

In order to do this, the vector needs sufficient size first (it doesn't need to be quite as high as n I think, but you can figure out those details), otherwise the writes are out of bounds. You can set the size to the actual size in the end.

For example,

#include <vector>
#include <stddef.h>

struct Range {
  int start;
  int end;
};

extern bool get_bool();

std::vector<Range> do_thing(size_t n)
{
    bool prev = false;
    Range range;
    std::vector<Range> result;
    result.resize(n);
    size_t index = 0;
    for (size_t i = 0; i < n; i++) {
        bool curr = get_bool();
        int mask = -(curr && !prev);
        range.start ^= (range.start ^ i) & mask;
        //range.start = (range.start & ~mask) | (i & mask); // alternative
        range.end = i-1;
        result[index] = range; // always store a range
        index += !curr && prev;
        prev = curr;
    }
    result.resize(index);
    return result;
}

But is it branchless?. Apparently yes. Clang is a bit more forgiving in this respect than GCC and also made branchless code if we don't use manual arithmetic tricks but the manual trickery was needed for GCC.

This also may not be more efficient, depending on how predictable the branches that are avoided this way were.

Clang 15 apparently turned the conditional update of the start into this:

    test    al, al
    mov     ecx, ebp
    cmovne  ecx, ebx
    lea     edx, [rbx - 1]  ; the result of this lea is used later, not relevant for this snippet
    test    r13b, r13b
    cmove   ebp, ecx

So using conditional moves instead of branches.

And the conditional increment of the index into:

    setne   cl
    mov     edx, eax
    xor     dl, 1
    and     dl, cl
    movzx   ecx, dl
    add     r12, rcx

Depending on where the values of curr come from it may even be possible to use SIMD for this. Here's a sketch:

The start indices and end indices can be filtered out separately and "compressed" (keeping only the indices for which curr && !prev or !curr && prev respectively) separately, then they will naturally "pair up". The locations with a start or end could be found by simple mask manipulation, use vpcompressd (with AVX512) or suitable emulation based on pshufb (pre-AVX512) or whatever your SIMD ISA of choice exposes to "compress" (aka left pack) the indices, do an unaligned store, and increase the destination pointer by std::popcount(mask).

0
Mooing Duck On

If the data is really random, then this isn't really optimizable. If the data tends to have runs, then it might be optimized like so:

result.reserve(lots);
int start = 0;
int i = 0;
while(true) { //this loop is unconditional. no branch
  while(i<n && !getCur(i)) //find next start. only branch miss should be when it finds a start
    i++;
  start = i;
  while(i<n && getCur(i)) //find next end. only brach miss should be when it finds an end.
    i++;
  if (i >= n) break;
  result.emplace_back({start, i-i});
}

https://godbolt.org/z/qrevovzna Both clang and GCC implemented each loop with 7-8 ops. They each have two branches (One for i<n and one for the result of getCur to either continue or exit the loop) but if your data has runs, then those would have a very high prediction rate.

This variant may incur function call overhead, depending on the complexity of getCur. If getCur is inlined, then it may require ~2x as much space in the instruction cache, so might not get the expected gains.


I tried storing the boolean results as 1s and 0s so that std::adjacent_difference

That just moves the branching elsewhere, and doesn't actually improve anything.

1
Remy Lebeau On

Not sure if this is branchless the way you want, but it would eliminate the need for using any ifs while iterating:

struct Range {
  int start;
  int end;
};

std::vector<Range> result;
if (n > 0) {
  result.reserve(n);
  Range range{0,0};

  // using std::function with capturing lambdas would be cleaner,
  // but since the purpose of this exercise seems to be about
  // optimizing, using plain function pointers instead...
  using vec = std::vector<Range>;
  using func = void(*)(vec&, Range&, int);
  func actions[2] = {
    +[](vec&, Range &r, int) {},
    +[](vec &v, Range &r, int curr) {
      r.end = curr - 1;
      v.push_back(r);
      r.start = curr;
    }
  };

  bool curr = ...; // get 1st value
  for (int i = 1; i < n; ++i) {
    bool val = ...; // get next value
    actions[val != curr](result, range, i);
    curr = val;
  }

  range.end = n - 1;
  result.push_back(range);
}

// use result as needed...

Online Demo

But really, that's just an ugly way to avoid using 1 if in the loop, which honestly probably isn't worth trying to avoid, IMHO:

struct Range {
  int start;
  int end;
};

std::vector<Range> result;
if (n > 0) {
  result.reserve(n);
  Range range{0,0};

  bool curr = ...; // get 1st value
  for (int i = 1; i < n; ++i) {
    bool val = ...; // get next value
    if (val != curr) {
      range.end = i - 1;
      result.push_back(range);
      range.start = i;
    }
    curr = val;
  }

  range.end = n - 1;
  result.push_back(range);
}

Online Demo

Don't over-complicate your code if you don't have to.