In a land far far away (meaning some 10-20 years ago) it used to be very important to do all kinds of extreme optimizations to get games and other performance-sensitive code to run at a decent speed. Often, one important optimization was to remove expensive arithmetic operations like multiplications and — heaven forfend — divisions. Anyone who was around then tends to have amassed a lot of “tricks” for this, which these days are really just meaningless trivia occupying valuable brain real-estate. I’m hoping that writing “my” tricks down might help me purge them from memory. Hey, it’s worth a try!
Multiplies through table lookup
On certain old 8-bit machines, such as the Apple II which had a 6502 with no multiply instruction, you would do multiplies either “longhand style” in code, or through table lookups. Table-based multiplication would either use a table of squares
a*b = [Table(a+b) - Table(a-b)]/4, where
Table(x) = x^2
or a table of triangular numbers:
a*b = Table(a+b) - Table(a) - Table(b), where
Table(x) = x * (x + 1) / 2
With the table of squares either the numbers had to be swapped so a was the larger number to avoid a negative index in the second table lookup, or the table had to be set up to handle negative indices.
Matrix-vector multiplication
The normal way of multiplying a 3×3 matrix and a 3-vector takes 9 multiplies, 6 additions:
x = m00*v0+m01*v1+m02*v2
y = m10*v0+m11*v1+m12*v2
y = m20*v0+m21*v1+m22*v2
We can perform this operation in 7 multiplies, 15 additions as
u = v0*v1
x = (v0 + m10)*(v1 + m00) - t0 - u + v2*m20
y = (v0 + m11)*(v1 + m01) - t1 - u + v2*m21
z = (v0 + m12)*(v1 + m02) - t2 - u + v2*m22
where t0 = m00*m10, t1 = m01*m11, t2 = m02*m12 are 3 multiplies that can be amortized over all vectors being transformed by the same matrix.
2D rotation
A standard 2D rotation by an angle θ involves 4 multiplies, 2 additions, and two trigonometric calls:
x' = cos(θ)*x - sin(θ)*y
y' = sin(θ)*x + cos(θ)*y
The same operation can be done in 3 multiplies, 3 additions, and two trigonometric calls (as well as a division by two, which was usually handled by folding it into the trig table):
t = y + x * tan(θ/2)
x' = x - t * sin(θ)
y' = x' * tan(θ/2) + t
Here the relevant identities are sin(θ) = (1 + cos(θ)) * tan(θ/2) and cos(θ) = 1 - sin(θ) * tan(θ/2). Furthermore, this trick is really just an application of the fact that a rotation can be seen as repeated shears (see Alan Paeth’s article “A fast algorithm for general raster rotation” on page 179 of Graphics Gems).
Multiplication of complex numbers
Given two complex numbers C0 = (a+bi) and C1 = (c+di) we can multiply these in 4 multiplies, 3 additions as follows:
(a+bi)*(c+di)
= ac + adi + bci + bdi^2
= (ac - bd) + (ad + bc)i
This operation can be done in 3 multiplies, 5 additions as:
t0 = a(c+d)
t1 = d(a+b)
t2 = b(c-d)
ac - bd = t0 - t1
ad + bc = t1 + t2
Multiplication of two 3×3 matrices
The standard way of multiplying two general 3×3 matrices is 27 multiplies, 18 additions. We can get this down to 24 multiplies using Winograd’s algorithm:
A[i] = sum(k=1,n/2) a[i,2k-1]*a[i,2k]
B[j] = sum(k=1,n/2) b[2k-1,j]*b[2k,j]
c[i,j] = sum(k=1,n/2){(a[i,2k-1]+b[2k,j]) * (a[i,2k]+b[2k-1,j])}-A[i]-B[j]
Winograd’s algorithm assumes even matrix dimension n, but we can apply it for n = 3 by expanding for n = 4 and setting a[*,4], a[4,*], b[*,4], and b[4,*] to zero and simplifying. Doing this results in this 3×3 matrix multiplication:
A[1] = a[1,1]*a[1,2]
A[2] = a[2,1]*a[2,2]
A[3] = a[3,1]*a[3,2]
B[1] = b[1,1]*b[2,1]
B[2] = b[1,2]*b[2,2]
B[3] = b[1,3]*b[2,3]
c[1,1] = (a[1,1]+b[2,1])*(a[1,2]+b[1,1]) + a[1,3]*b[3,1] - A[1] - B[1]
c[1,2] = (a[1,1]+b[2,2])*(a[1,2]+b[1,2]) + a[1,3]*b[3,2] - A[1] - B[2]
c[1,3] = (a[1,1]+b[2,3])*(a[1,2]+b[1,3]) + a[1,3]*b[3,3] - A[1] - B[3]
c[2,1] = (a[2,1]+b[2,1])*(a[2,2]+b[1,1]) + a[2,3]*b[3,1] - A[2] - B[1]
c[2,2] = (a[2,1]+b[2,2])*(a[2,2]+b[1,2]) + a[2,3]*b[3,2] - A[2] - B[2]
c[2,3] = (a[2,1]+b[2,3])*(a[2,2]+b[1,3]) + a[2,3]*b[3,3] - A[2] - B[3]
c[3,1] = (a[3,1]+b[2,1])*(a[3,2]+b[1,1]) + a[3,3]*b[3,1] - A[3] - B[1]
c[3,2] = (a[3,1]+b[2,2])*(a[3,2]+b[1,2]) + a[3,3]*b[3,2] - A[3] - B[2]
c[3,3] = (a[3,1]+b[2,3])*(a[3,2]+b[1,3]) + a[3,3]*b[3,3] - A[3] - B[3]
This is 6 mults + 9*(2 mult + 5 add/subs) = 24 mults + 45 add/subs (or 69 flops in total). Note the similarities with the matrix-vector multiplication in 7 multiplies.
It is possible to perform a 3×3 matrix multiplication in 23 multiplies (and many more additions) using algorithms by Laderman [link], and Johnson and McLoughlin [link1][link2].
Others
OK, so there are some others too, such as quaternion multiplication in 8 multiplications and cross products in 5 multiplications, but this is what I had time for right now. What other old-school multiplication tricks did I leave out? Comments invited!
Well, how about log/exp tables for multiplications?
a * b = e^(log(a) + log(b) = eTable[logTable[a] + logTable[b]]
… where eTable[n] performs e^n, and logTable[n] performs log(n).
It has range and accuracy issues, but can nevertheless be handy when, say, transforming static 3D meshes. There, you can perform all the logarithm lookups in advance, and a dot product ends up as:
a * b + c * d + e * f = eTable[logA + logB] + eTable[logC + logD] + eTable[logE + logF]
So when transforming an entire vertex, setup 9 base pointers into the e^n table (one for each element in the 3×3 matrix) and you get:
x’ = base00[logX] + base01[logY] + base02[logZ]
y’ = base10[logX] + base11[logY] + base12[logZ]
z’ = base20[logX] + base21[logY] + base22[logZ]
… which can be done in roughly 9 instructions in total on the 68000 (depending on effective numerical range of the 3×3 matrix elements).
The log/exp tables were indeed quite useful: not only it was faster, but also it consumed a fixed amount of cycles compared to MULS and MULU on a 68000. So, using those log/exp tables was pretty much the only way to do 3D in “fullscreen” on Atari ST, for example. (Basically it was not possible to use multiplications and divisions in this mode).
Another similar trick was to rotate the bounding box of a 3D mesh, and simply compute N vectors along its rotated X, Y and Z axes using interpolation. Those N vectors are stored in 3 small LUTs, and then the object’s coordinates are simply used to index those LUTs, and the vectors are added together to compute the “correct” rotated 3D point. Since the interpolation was almost free, it effectively transformed the “9 muls into 9 adds”, as I wrote in a small article a loooong time ago. It was very effective on the 68000, and we used it pretty much everywhere :)
– Pierre