## Branchless selections

Some CPU instructions sets offer a way of selecting bits from either of two registers based on the bits of a third register. For example, the SPUs of the CELL in the PS3 have the **selb** instruction:

**selb rt,ra,rb,rc**

The result of **selb** is placed in **rt** and each bit of **rt** is either the corresponding bit from **ra** (if the bit in **rc** is 0) or from **rb** (if the bit in **rc** is 1).

Expressed in a familiar C-style notation, for all bits **b** in parallel, we are executing the statement:

if (rc_{b}) rt_{b}= rb_{b}; else rt_{b}= ra_{b};

It is hopefully pretty evident that if no selb-like instruction is present, we can still express this selection to perform all if-statements in parallel, without using branches, through the expression:

rt = (rb & rc) | (ra & ~rc);

Not as obvious (at least not at first glance) is the old-school bit trick for implementing the same operation:

rt = ((ra ^ rb) & rc) ^ ra;

Chris Crawford points out that asking what that last line of code does, used to be an interview question at Atari back in 1979 (except, of course, phrased in 6502 assembly language). Would you have been hired?

## Rim said,

November 11, 2008 @ 12:44 am

I’ll thankfully hide behind the fact that I wasn’t even born back then, but I wouldn’t have been hired. Bitwise operations don’t seem to get the attention they deserve anymore though. At least when I was taught Java back in the day, I don’t recall hearing a single word on these operators.

Out of cursiosity, why did that last trick get thought up? Is it really the single bit operation less that makes the difference? If so, I don’t know if I regret not being born earlier :)

## hcpizzi said,

November 11, 2008 @ 6:03 am

Actually, if

raandrbare constants (as is the case in Chris Crawford’s use) you can also precomputera ^ rb, saving yet another instruction.## Kalms said,

November 11, 2008 @ 1:13 pm

Also, the second form allows you to compute the result without needing any temporary registers, like:

rt = ra;

rt ^= rb;

rt &= rc;

rt ^= ra;

This was probably more important “back then” (when CPUs had approx. 5 registers and most ALU ops required src1/dest to be the accumulator).

## SolomonS said,

November 11, 2008 @ 2:06 pm

Here’s a fun related problem:

Given 2 unsigned 32 bit integers a and b, how would you compute a branchless version of max(a,b), without using a conditional move instruction?

## supzi said,

November 11, 2008 @ 3:54 pm

I think I have a solution to your problem Solomon, here is the pseudocode :

diff = (a - b);

diff_sign = diff >> (size_of_int-1);

max = b * diff_sign + a * (1-diff_sign);

the last line can be replaced by a branchless selection

## SolomonS said,

November 11, 2008 @ 6:15 pm

supzi, this works for a limited range, but not for the entire 32 bits. For example, let a = 0xFFFFFFFF, and b = 1.

## christer said,

November 12, 2008 @ 1:02 am

“Out of curiosity, why did that last trick get thought up?”

Rim, in the good ol’ days you would see it in, amongst other things, sprite code on the Apple II. It took some google-fu to find an example of this use, but here’s some 6502 code from Bill Budge et al’s December 1984 BYTE magazine article “Preshift-Table graphics on Your Apple” (from retyped code listing posted to comp.sys.apple2.programmer):

“Is it really the single bit operation less that makes the difference? If so, I don’t know if I regret not being born earlier :)”

Oh, single cycles made

allthe difference back then!The above isn’t necessarily the best way of doing sprites, btw. I would probably have opted for this approach myself:

## hcpizzi said,

November 12, 2008 @ 7:46 am

Right, no temporaries, that’s probably the best advantage! Now that is really impressive, love this tricks and all the though behind them!

## supzi said,

December 1, 2008 @ 3:18 pm

Sorry for the very late reply Solomon,

I don’t see why it is limited to a specific range, the solution itself is just pseudocode, you can always adapt it to your needs

Here’s a c++ solution that is compatible with 32 bits unsigned int :

typedef long long int64;

typedef unsigned long long uint64;

typedef unsigned int uint;

int64 diff( uint a, uint b ) { return (int64( a ) - int64( b )); }

uint sign( int64 val ) { return uint( uint64( val ) >> ( (sizeof( uint64 ) * 8) -1 )); }

uint branchless_sel( uint r1, uint r2, uint val ) { return r1*(1-val) + r2*val; }

uint max_branchless( uint r1, uint r2 ) { return branchless_sel(r1, r2, sign(diff(r1, r2)) ); }

## SolomonS said,

December 2, 2008 @ 1:31 am

supzi, can you think of a solution that doesn’t require 64 bit ints? or 33 bit ints for that matter.

## guardian said,

December 9, 2008 @ 5:04 am

Hello,

In today’s applications, where would this trick find a use (apart from sprite code)?

## supzi said,

December 10, 2008 @ 4:17 pm

Solomon,

If I can’t use 64 bits ints or 33 bits, in other words there’s no way to know if a subtraction did an overflow then I would consider splitting the 32 bits, check the first bit, then check the remaining bits, something like this :

const int uint_bit_shift = sizeof( uint )*8-1;

const unsigned int uint_bit_mask = ~(1 > uint_bit_shift, r2 >> uint_bit_shift ) );

uint last_bits_sign = sign( diff( r1 & uint_bit_mask, r2 & uint_bit_mask ) );

return branchless_sel(r1, r2, first_bit_sign | last_bits_sign );

May be not the best solution ever but it is still branchless!

I presume this is not the solution you are waiting for, perhaps you have a cool trick that you want to share?

## supzi said,

December 10, 2008 @ 4:24 pm

ooops, there’s an error my previous post, this is how it should be :

const unsigned int uint_bit_mask = unsigned int(-1)/2;

uint first_bit_sign = sign( diff( r1 >> uint_bit_shift, r2 >> uint_bit_shift ) );

uint last_bits_sign = sign( diff( r1 & uint_bit_mask, r2 & uint_bit_mask ) );

## SolomonS said,

December 10, 2008 @ 10:21 pm

supzi, this is not my trick, but it is branchless, since the cmp instruction isn’t a branch.

unsigned int _max(unsigned int x, unsigned int y)

{

unsigned int r = x - ((x - y) & -(x

## SolomonS said,

December 10, 2008 @ 10:23 pm

Sorry, it got cut off. Let’s try again.

unsigned int _max(unsigned int x, unsigned int y)

{

unsigned int r = x - ((x-y) & -(x

## tony said,

December 14, 2008 @ 10:49 pm

GoW3 teaser,

Neogaf has gone bizarre over the footage. Underwhelming is what they are saying, thanks to Jaffe’s comment about art coming to life.

But I hope you guys will deliver it, no doubt.

## Branch free Clamp() « Miles Macklin said,

January 9, 2009 @ 4:05 pm

[…] http://realtimecollisiondetection.net/blog/?p=90 […]

## EddieEdwards said,

May 13, 2009 @ 8:21 am

I like the XOR trick. Very appropriate for SSE2.