Some CPU instructions sets offer a way of selecting bits from either of two registers based on the bits of a third register. For example, the SPUs of the CELL in the PS3 have the selb instruction:
selb rt,ra,rb,rc
The result of selb is placed in rt and each bit of rt is either the corresponding bit from ra (if the bit in rc is 0) or from rb (if the bit in rc is 1).
Expressed in a familiar C-style notation, for all bits b in parallel, we are executing the statement:
if (rcb) rtb = rbb; else rtb = rab;
It is hopefully pretty evident that if no selb-like instruction is present, we can still express this selection to perform all if-statements in parallel, without using branches, through the expression:
rt = (rb & rc) | (ra & ~rc);
Not as obvious (at least not at first glance) is the old-school bit trick for implementing the same operation:
rt = ((ra ^ rb) & rc) ^ ra;
Chris Crawford points out that asking what that last line of code does, used to be an interview question at Atari back in 1979 (except, of course, phrased in 6502 assembly language). Would you have been hired?
I’ll thankfully hide behind the fact that I wasn’t even born back then, but I wouldn’t have been hired. Bitwise operations don’t seem to get the attention they deserve anymore though. At least when I was taught Java back in the day, I don’t recall hearing a single word on these operators.
Out of cursiosity, why did that last trick get thought up? Is it really the single bit operation less that makes the difference? If so, I don’t know if I regret not being born earlier :)
Actually, if ra and rb are constants (as is the case in Chris Crawford’s use) you can also precompute ra ^ rb, saving yet another instruction.
Also, the second form allows you to compute the result without needing any temporary registers, like:
rt = ra;
rt ^= rb;
rt &= rc;
rt ^= ra;
This was probably more important “back then” (when CPUs had approx. 5 registers and most ALU ops required src1/dest to be the accumulator).
Here’s a fun related problem:
Given 2 unsigned 32 bit integers a and b, how would you compute a branchless version of max(a,b), without using a conditional move instruction?
I think I have a solution to your problem Solomon, here is the pseudocode :
diff = (a – b);
diff_sign = diff >> (size_of_int-1);
max = b * diff_sign + a * (1-diff_sign);
the last line can be replaced by a branchless selection
supzi, this works for a limited range, but not for the entire 32 bits. For example, let a = 0xFFFFFFFF, and b = 1.
“Out of curiosity, why did that last trick get thought up?”
Rim, in the good ol’ days you would see it in, amongst other things, sprite code on the Apple II. It took some google-fu to find an example of this use, but here’s some 6502 code from Bill Budge et al’s December 1984 BYTE magazine article “Preshift-Table graphics on Your Apple” (from retyped code listing posted to comp.sys.apple2.programmer):
“Is it really the single bit operation less that makes the difference? If so, I don’t know if I regret not being born earlier :)”
Oh, single cycles made all the difference back then!
The above isn’t necessarily the best way of doing sprites, btw. I would probably have opted for this approach myself:
Right, no temporaries, that’s probably the best advantage! Now that is really impressive, love this tricks and all the though behind them!
Sorry for the very late reply Solomon,
I don’t see why it is limited to a specific range, the solution itself is just pseudocode, you can always adapt it to your needs
Here’s a c++ solution that is compatible with 32 bits unsigned int :
typedef long long int64;
typedef unsigned long long uint64;
typedef unsigned int uint;
int64 diff( uint a, uint b ) { return (int64( a ) – int64( b )); }
uint sign( int64 val ) { return uint( uint64( val ) >> ( (sizeof( uint64 ) * 8) -1 )); }
uint branchless_sel( uint r1, uint r2, uint val ) { return r1*(1-val) + r2*val; }
uint max_branchless( uint r1, uint r2 ) { return branchless_sel(r1, r2, sign(diff(r1, r2)) ); }
supzi, can you think of a solution that doesn’t require 64 bit ints? or 33 bit ints for that matter.
Hello,
In today’s applications, where would this trick find a use (apart from sprite code)?
Solomon,
If I can’t use 64 bits ints or 33 bits, in other words there’s no way to know if a subtraction did an overflow then I would consider splitting the 32 bits, check the first bit, then check the remaining bits, something like this :
const int uint_bit_shift = sizeof( uint )*8-1;
const unsigned int uint_bit_mask = ~(1 > uint_bit_shift, r2 >> uint_bit_shift ) );
uint last_bits_sign = sign( diff( r1 & uint_bit_mask, r2 & uint_bit_mask ) );
return branchless_sel(r1, r2, first_bit_sign | last_bits_sign );
May be not the best solution ever but it is still branchless!
I presume this is not the solution you are waiting for, perhaps you have a cool trick that you want to share?
ooops, there’s an error my previous post, this is how it should be :
const unsigned int uint_bit_mask = unsigned int(-1)/2;
uint first_bit_sign = sign( diff( r1 >> uint_bit_shift, r2 >> uint_bit_shift ) );
uint last_bits_sign = sign( diff( r1 & uint_bit_mask, r2 & uint_bit_mask ) );
supzi, this is not my trick, but it is branchless, since the cmp instruction isn’t a branch.
unsigned int _max(unsigned int x, unsigned int y)
{
unsigned int r = x – ((x – y) & -(x
Sorry, it got cut off. Let’s try again.
unsigned int _max(unsigned int x, unsigned int y)
{
unsigned int r = x – ((x-y) & -(x
GoW3 teaser,
Neogaf has gone bizarre over the footage. Underwhelming is what they are saying, thanks to Jaffe’s comment about art coming to life.
But I hope you guys will deliver it, no doubt.
I like the XOR trick. Very appropriate for SSE2.