OpenCL boolean expression unwanted lazy evaluation

387 Views Asked by At

From OpenCL 2.0 specification, chapter "6.3 Operators", page 29:

g. The logical operators and (&&), or (||) operate on all scalar and vector built-in types. For scalar built-in types only, and (&&) will only evaluate the right hand operand if the left hand operand compares unequal to 0. For scalar built-in types only, or (||) will only evaluate the right hand operand if the left hand operand compares equal to 0. For built-in vector types, both operands are evaluated and the operators are applied component-wise. If one operand is a scalar and the other is a vector, the scalar may be subject to the usual arithmetic conversion to the element type used by the vector operand. The scalar type is then widened to a vector that has the same number of components as the vector operand. The operation is done component-wise resulting in the same size vector.

This means that using expressions with logical operators will result in branching and thread divergence, which in turn results in a loss of performance on some parallel platforms. Example:

int min_nonzero(int a, int b)
{
    return (a < b && a != 0)? a : b; // branch
}

This can be partially fixed, as:

int min_nonzero(int a, int b)
{
    return select(b, a, a < b && a != 0); // branch because of &&
}

Where the built-in function select will likely be implemented using arithmetics to avoid branching (e.g. as linear interpolation). But there is still the branch in &&. A possibly better way:

int min_nonzero(int a, int b)
{
    return select(b, a, (int)(a < b) & (int)(a != 0)); // branch free
}

But that quickly becomes unreadable.

So my question is: is there a better way of convincing the OpenCL compiler to abandon the lazy evaluation of the boolean expressions (not globally but in selected cases)?


Below is my practical experimentation in this matter, not really a question anymore. Lazy evaluation is still required in some cases, such as in:

if(i < N && array[i] == x) // will go OOB without lazy evaluation

So it is quite unlikely that the optimizer would disable it entirely or at least in all applicable cases.

I'm looking at some PTX generated by NVIDIA 320.49 driver and it will only optimize cases with no array access on the right:

if(p[i] == n_end && i)
    return;

compiles to a single branch:

    setp.ne.s32     %p2, %r17, %r5; // p[i] != n_end
    setp.eq.s32     %p3, %r28, 0; // !i
    or.pred     %p4, %p2, %p3; // (p[i] != n_end || !i) = !(p[i] == n_end && i)
    @%p4 bra    BB2_3; // branch
    ret;

BB2_3:

Yet this:

int n_increment = 1;
for(++ i; i < n_cols_B && p[i + 1] == n_end; ++ i)
    ++ n_increment;

compiles to:

    mov.u32     %r29, 1;

BB2_4:
    mov.u32     %r6, %r28;
    add.s32     %r8, %r6, 1;
    ld.param.u32    %r24, [Fill_ColsTailFlags_v0_const_param_0];
    setp.ge.u32     %p5, %r8, %r24;
    @%p5 bra    BB2_6; // branch if i < n_cols_B

    shl.b32     %r19, %r6, 2;
    ld.param.u32    %r25, [Fill_ColsTailFlags_v0_const_param_1];
    add.s32     %r20, %r19, %r25;
    ld.const.u32    %r21, [%r20+8];
    setp.eq.s32     %p6, %r21, %r5;
    @%p6 bra    BB2_7; // branch if p[i + 1] == n_end

BB2_6:
    shl.b32     %r22, %r5, 2;
    ld.param.u32    %r27, [Fill_ColsTailFlags_v0_const_param_2];
    add.s32     %r23, %r27, %r22;
    st.global.u32   [%r23], %r29;
    ret;

BB2_7:
    add.s32     %r29, %r29, 1;
    mov.u32     %r28, %r8;
    bra.uni     BB2_4;

Seems like it is shy of the array access as it doesn't know how to analyze correctness of the array access with respect to the condition on the left. Switching order of the conditions to p[i + 1] == n_end && i < n_cols_B gets rid of the branch in this case. Changing the index to a constant i < n_cols_B && B_p[j] == n_end (where j = get_global_id(0) is initialized at the beginning) does not get rid of the branch.

0

There are 0 best solutions below