Branchless selections

Some CPU instructions sets offer a way of selecting bits from either of two registers based on the bits of a third register. For example, the SPUs of the CELL in the PS3 have the selb instruction:

selb rt,ra,rb,rc

The result of selb is placed in rt and each bit of rt is either the corresponding bit from ra (if the bit in rc is 0) or from rb (if the bit in rc is 1).

Expressed in a familiar C-style notation, for all bits b in parallel, we are executing the statement:

if (rcb)
    rtb = rbb;
else
    rtb = rab;

It is hopefully pretty evident that if no selb-like instruction is present, we can still express this selection to perform all if-statements in parallel, without using branches, through the expression:

rt = (rb & rc) | (ra & ~rc);

Not as obvious (at least not at first glance) is the old-school bit trick for implementing the same operation:

rt = ((ra ^ rb) & rc) ^ ra;

Chris Crawford points out that asking what that last line of code does, used to be an interview question at Atari back in 1979 (except, of course, phrased in 6502 assembly language). Would you have been hired?

Similar Posts:

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Reddit
  • Slashdot
  • StumbleUpon
  • Technorati
  • LinkedIn

18 Comments »

  1. Rim said,

    November 11, 2008 @ 12:44 am

    I’ll thankfully hide behind the fact that I wasn’t even born back then, but I wouldn’t have been hired. Bitwise operations don’t seem to get the attention they deserve anymore though. At least when I was taught Java back in the day, I don’t recall hearing a single word on these operators.

    Out of cursiosity, why did that last trick get thought up? Is it really the single bit operation less that makes the difference? If so, I don’t know if I regret not being born earlier :)

  2. hcpizzi said,

    November 11, 2008 @ 6:03 am

    Actually, if ra and rb are constants (as is the case in Chris Crawford’s use) you can also precompute ra ^ rb, saving yet another instruction.

  3. Kalms said,

    November 11, 2008 @ 1:13 pm

    Also, the second form allows you to compute the result without needing any temporary registers, like:

    rt = ra;
    rt ^= rb;
    rt &= rc;
    rt ^= ra;

    This was probably more important “back then” (when CPUs had approx. 5 registers and most ALU ops required src1/dest to be the accumulator).

  4. SolomonS said,

    November 11, 2008 @ 2:06 pm

    Here’s a fun related problem:

    Given 2 unsigned 32 bit integers a and b, how would you compute a branchless version of max(a,b), without using a conditional move instruction?

  5. supzi said,

    November 11, 2008 @ 3:54 pm

    I think I have a solution to your problem Solomon, here is the pseudocode :

    diff = (a - b);
    diff_sign = diff >> (size_of_int-1);
    max = b * diff_sign + a * (1-diff_sign);

    the last line can be replaced by a branchless selection

  6. SolomonS said,

    November 11, 2008 @ 6:15 pm

    supzi, this works for a limited range, but not for the entire 32 bits. For example, let a = 0xFFFFFFFF, and b = 1.

  7. christer said,

    November 12, 2008 @ 1:02 am

    “Out of curiosity, why did that last trick get thought up?”

    Rim, in the good ol’ days you would see it in, amongst other things, sprite code on the Apple II. It took some google-fu to find an example of this use, but here’s some 6502 code from Bill Budge et al’s December 1984 BYTE magazine article “Preshift-Table graphics on Your Apple” (from retyped code listing posted to comp.sys.apple2.programmer):

            ...
    ; ACTUAL DRAW CODE STARTS HERE.  NOTE THAT THE
    ; RIGHTHAND AND LEFTHAND EDGE BYTES ARE TREATED
    ; DIFFERENTLY BECAUSE THEY MAY BE PARTIAL-BYTE
    ; WRITE OPERATIONS
    ;
            LDY SBWIDTH   ; SHIFTED DATA BYTE WIDTH
            LDX TEMP      ; GET BACK THE ROW #
    ;
    ; DO THE RIGHTHAND IMAGE BYTE
    ;
            LDA (BASE1),Y ; GET A SCREEN BYTE
            EOR BUFFER,Y  ; XOR WITH IMAGE DATA
            AND RMASK     ; MASK THE UNWRITTEN AREA
            EOR (BASE1),Y ; RESTORE SCRN AND IMAGE
            JMP DRAW2
    ;
    ; FAST LOOP TO DRAW IN-BETWEEN BYTES
    ;
    DRAW1   LDA BUFFER,Y  ; GET AN IMAGE BYTE
    DRAW2   STA (BASE1),Y ; WRITE BYTE TO SCREEN
            DEY           ; DECREMENT COUNT
            BNE DRAW1     ; LOOP BACK IF NOT DONE
    ;
    ; FINISH UP WITH THE LEFTHAND BYTE
    ;
    DRAW3   LDA (BASE1),Y ; GET A SCREEN BYTE
            EOR BUFFER,Y  ; XOR WITH IMAGE BYTE
            AND LMASK     ; MASK UNWANTED PART
            EOR (BASE1),Y ; RESTORE SCRN AND IMAGE
            STA (BASE1),Y ; AND WRITE TO SCREEN
            ...
    

    “Is it really the single bit operation less that makes the difference? If so, I don’t know if I regret not being born earlier :)”

    Oh, single cycles made all the difference back then!

    The above isn’t necessarily the best way of doing sprites, btw. I would probably have opted for this approach myself:

            LDA (BASE1),Y ; GET A SCREEN BYTE
            AND RMASK     ; STENCIL OUT A HOLE
            OR BUFFER,Y   ; OR WITH IMAGE DATA
            ...
    
  8. hcpizzi said,

    November 12, 2008 @ 7:46 am

    Right, no temporaries, that’s probably the best advantage! Now that is really impressive, love this tricks and all the though behind them!

  9. supzi said,

    December 1, 2008 @ 3:18 pm

    Sorry for the very late reply Solomon,

    I don’t see why it is limited to a specific range, the solution itself is just pseudocode, you can always adapt it to your needs

    Here’s a c++ solution that is compatible with 32 bits unsigned int :

    typedef long long int64;
    typedef unsigned long long uint64;
    typedef unsigned int uint;

    int64 diff( uint a, uint b ) { return (int64( a ) - int64( b )); }
    uint sign( int64 val ) { return uint( uint64( val ) >> ( (sizeof( uint64 ) * 8) -1 )); }
    uint branchless_sel( uint r1, uint r2, uint val ) { return r1*(1-val) + r2*val; }
    uint max_branchless( uint r1, uint r2 ) { return branchless_sel(r1, r2, sign(diff(r1, r2)) ); }

  10. SolomonS said,

    December 2, 2008 @ 1:31 am

    supzi, can you think of a solution that doesn’t require 64 bit ints? or 33 bit ints for that matter.

  11. guardian said,

    December 9, 2008 @ 5:04 am

    Hello,

    In today’s applications, where would this trick find a use (apart from sprite code)?

  12. supzi said,

    December 10, 2008 @ 4:17 pm

    Solomon,

    If I can’t use 64 bits ints or 33 bits, in other words there’s no way to know if a subtraction did an overflow then I would consider splitting the 32 bits, check the first bit, then check the remaining bits, something like this :

    const int uint_bit_shift = sizeof( uint )*8-1;
    const unsigned int uint_bit_mask = ~(1 > uint_bit_shift, r2 >> uint_bit_shift ) );
    uint last_bits_sign = sign( diff( r1 & uint_bit_mask, r2 & uint_bit_mask ) );

    return branchless_sel(r1, r2, first_bit_sign | last_bits_sign );

    May be not the best solution ever but it is still branchless!

    I presume this is not the solution you are waiting for, perhaps you have a cool trick that you want to share?

  13. supzi said,

    December 10, 2008 @ 4:24 pm

    ooops, there’s an error my previous post, this is how it should be :

    const unsigned int uint_bit_mask = unsigned int(-1)/2;
    uint first_bit_sign = sign( diff( r1 >> uint_bit_shift, r2 >> uint_bit_shift ) );
    uint last_bits_sign = sign( diff( r1 & uint_bit_mask, r2 & uint_bit_mask ) );

  14. SolomonS said,

    December 10, 2008 @ 10:21 pm

    supzi, this is not my trick, but it is branchless, since the cmp instruction isn’t a branch.

    unsigned int _max(unsigned int x, unsigned int y)
    {
    unsigned int r = x - ((x - y) & -(x

  15. SolomonS said,

    December 10, 2008 @ 10:23 pm

    Sorry, it got cut off. Let’s try again.

    unsigned int _max(unsigned int x, unsigned int y)
    {
    unsigned int r = x - ((x-y) & -(x

  16. tony said,

    December 14, 2008 @ 10:49 pm

    GoW3 teaser,

    Neogaf has gone bizarre over the footage. Underwhelming is what they are saying, thanks to Jaffe’s comment about art coming to life.
    But I hope you guys will deliver it, no doubt.

  17. Branch free Clamp() « Miles Macklin said,

    January 9, 2009 @ 4:05 pm

    […] http://realtimecollisiondetection.net/blog/?p=90 […]

  18. EddieEdwards said,

    May 13, 2009 @ 8:21 am

    I like the XOR trick. Very appropriate for SSE2.

RSS feed for comments on this post · TrackBack URI

Leave a Comment

You must be logged in to post a comment.