Branchless selections

November 10, 2008 christer 18 Comments

Some CPU instructions sets offer a way of selecting bits from either of two registers based on the bits of a third register. For example, the SPUs of the CELL in the PS3 have the selb instruction:

selb rt,ra,rb,rc

The result of selb is placed in rt and each bit of rt is either the corresponding bit from ra (if the bit in rc is 0) or from rb (if the bit in rc is 1).

Expressed in a familiar C-style notation, for all bits b in parallel, we are executing the statement:

if (rc_b)
    rt_b = rb_b;
else
    rt_b = ra_b;

It is hopefully pretty evident that if no selb-like instruction is present, we can still express this selection to perform all if-statements in parallel, without using branches, through the expression:

rt = (rb & rc) | (ra & ~rc);

Not as obvious (at least not at first glance) is the old-school bit trick for implementing the same operation:

rt = ((ra ^ rb) & rc) ^ ra;

Chris Crawford points out that asking what that last line of code does, used to be an interview question at Atari back in 1979 (except, of course, phrased in 6502 assembly language). Would you have been hired?

18 thoughts on “Branchless selections”

Rim says:

November 11, 2008 at 12:44 am

I’ll thankfully hide behind the fact that I wasn’t even born back then, but I wouldn’t have been hired. Bitwise operations don’t seem to get the attention they deserve anymore though. At least when I was taught Java back in the day, I don’t recall hearing a single word on these operators.

Out of cursiosity, why did that last trick get thought up? Is it really the single bit operation less that makes the difference? If so, I don’t know if I regret not being born earlier :)

Log in to Reply
hcpizzi says:

November 11, 2008 at 6:03 am

Actually, if ra and rb are constants (as is the case in Chris Crawford’s use) you can also precompute ra ^ rb, saving yet another instruction.

Log in to Reply
Kalms says:

November 11, 2008 at 1:13 pm

Also, the second form allows you to compute the result without needing any temporary registers, like:

rt = ra;
rt ^= rb;
rt &= rc;
rt ^= ra;

This was probably more important “back then” (when CPUs had approx. 5 registers and most ALU ops required src1/dest to be the accumulator).

Log in to Reply
SolomonS says:

November 11, 2008 at 2:06 pm

Here’s a fun related problem:

Given 2 unsigned 32 bit integers a and b, how would you compute a branchless version of max(a,b), without using a conditional move instruction?

Log in to Reply
supzi says:

November 11, 2008 at 3:54 pm

I think I have a solution to your problem Solomon, here is the pseudocode :

diff = (a – b);
diff_sign = diff >> (size_of_int-1);
max = b * diff_sign + a * (1-diff_sign);

the last line can be replaced by a branchless selection

Log in to Reply
SolomonS says:

November 11, 2008 at 6:15 pm

supzi, this works for a limited range, but not for the entire 32 bits. For example, let a = 0xFFFFFFFF, and b = 1.

Log in to Reply

“Out of curiosity, why did that last trick get thought up?”

Rim, in the good ol’ days you would see it in, amongst other things, sprite code on the Apple II. It took some google-fu to find an example of this use, but here’s some 6502 code from Bill Budge et al’s December 1984 BYTE magazine article “Preshift-Table graphics on Your Apple” (from retyped code listing posted to comp.sys.apple2.programmer):

        ...
; ACTUAL DRAW CODE STARTS HERE.  NOTE THAT THE
; RIGHTHAND AND LEFTHAND EDGE BYTES ARE TREATED
; DIFFERENTLY BECAUSE THEY MAY BE PARTIAL-BYTE
; WRITE OPERATIONS
;
        LDY SBWIDTH   ; SHIFTED DATA BYTE WIDTH
        LDX TEMP      ; GET BACK THE ROW #
;
; DO THE RIGHTHAND IMAGE BYTE
;
        LDA (BASE1),Y ; GET A SCREEN BYTE
        EOR BUFFER,Y  ; XOR WITH IMAGE DATA
        AND RMASK     ; MASK THE UNWRITTEN AREA
        EOR (BASE1),Y ; RESTORE SCRN AND IMAGE
        JMP DRAW2
;
; FAST LOOP TO DRAW IN-BETWEEN BYTES
;
DRAW1   LDA BUFFER,Y  ; GET AN IMAGE BYTE
DRAW2   STA (BASE1),Y ; WRITE BYTE TO SCREEN
        DEY           ; DECREMENT COUNT
        BNE DRAW1     ; LOOP BACK IF NOT DONE
;
; FINISH UP WITH THE LEFTHAND BYTE
;
DRAW3   LDA (BASE1),Y ; GET A SCREEN BYTE
        EOR BUFFER,Y  ; XOR WITH IMAGE BYTE
        AND LMASK     ; MASK UNWANTED PART
        EOR (BASE1),Y ; RESTORE SCRN AND IMAGE
        STA (BASE1),Y ; AND WRITE TO SCREEN 
        ...

“Is it really the single bit operation less that makes the difference? If so, I don’t know if I regret not being born earlier :)”

Oh, single cycles made all the difference back then!

The above isn’t necessarily the best way of doing sprites, btw. I would probably have opted for this approach myself:

        LDA (BASE1),Y ; GET A SCREEN BYTE
        AND RMASK     ; STENCIL OUT A HOLE
        OR BUFFER,Y   ; OR WITH IMAGE DATA
        ...

hcpizzi says:

November 12, 2008 at 7:46 am

Right, no temporaries, that’s probably the best advantage! Now that is really impressive, love this tricks and all the though behind them!

Log in to Reply
supzi says:

December 1, 2008 at 3:18 pm

Sorry for the very late reply Solomon,

I don’t see why it is limited to a specific range, the solution itself is just pseudocode, you can always adapt it to your needs

Here’s a c++ solution that is compatible with 32 bits unsigned int :

typedef long long int64;
typedef unsigned long long uint64;
typedef unsigned int uint;

int64 diff( uint a, uint b ) { return (int64( a ) – int64( b )); }
uint sign( int64 val ) { return uint( uint64( val ) >> ( (sizeof( uint64 ) * 8) -1 )); }
uint branchless_sel( uint r1, uint r2, uint val ) { return r1*(1-val) + r2*val; }
uint max_branchless( uint r1, uint r2 ) { return branchless_sel(r1, r2, sign(diff(r1, r2)) ); }

Log in to Reply
SolomonS says:

December 2, 2008 at 1:31 am

supzi, can you think of a solution that doesn’t require 64 bit ints? or 33 bit ints for that matter.

Log in to Reply
guardian says:

December 9, 2008 at 5:04 am

Hello,

In today’s applications, where would this trick find a use (apart from sprite code)?

Log in to Reply
supzi says:

December 10, 2008 at 4:17 pm

Solomon,

If I can’t use 64 bits ints or 33 bits, in other words there’s no way to know if a subtraction did an overflow then I would consider splitting the 32 bits, check the first bit, then check the remaining bits, something like this :

const int uint_bit_shift = sizeof( uint )*8-1;
const unsigned int uint_bit_mask = ~(1 > uint_bit_shift, r2 >> uint_bit_shift ) );
uint last_bits_sign = sign( diff( r1 & uint_bit_mask, r2 & uint_bit_mask ) );

return branchless_sel(r1, r2, first_bit_sign | last_bits_sign );

May be not the best solution ever but it is still branchless!

I presume this is not the solution you are waiting for, perhaps you have a cool trick that you want to share?

Log in to Reply
supzi says:

December 10, 2008 at 4:24 pm

ooops, there’s an error my previous post, this is how it should be :

const unsigned int uint_bit_mask = unsigned int(-1)/2;
uint first_bit_sign = sign( diff( r1 >> uint_bit_shift, r2 >> uint_bit_shift ) );
uint last_bits_sign = sign( diff( r1 & uint_bit_mask, r2 & uint_bit_mask ) );

Log in to Reply
SolomonS says:

December 10, 2008 at 10:21 pm

supzi, this is not my trick, but it is branchless, since the cmp instruction isn’t a branch.

unsigned int _max(unsigned int x, unsigned int y)
{
unsigned int r = x – ((x – y) & -(x

Log in to Reply
SolomonS says:

December 10, 2008 at 10:23 pm

Sorry, it got cut off. Let’s try again.

unsigned int _max(unsigned int x, unsigned int y)
{
unsigned int r = x – ((x-y) & -(x

Log in to Reply
tony says:

December 14, 2008 at 10:49 pm

GoW3 teaser,

Neogaf has gone bizarre over the footage. Underwhelming is what they are saying, thanks to Jaffe’s comment about art coming to life.
But I hope you guys will deliver it, no doubt.

Log in to Reply
Pingback: Branch free Clamp() « Miles Macklin
EddieEdwards says:

May 13, 2009 at 8:21 am

I like the XOR trick. Very appropriate for SSE2.

Log in to Reply

Posts

Branchless selections

18 thoughts on “Branchless selections”

Leave a Reply Cancel reply

Related posts:

18 thoughts on “Branchless selections”

Leave a Reply Cancel reply