Hacker’s Delight #8: Matrix Flip

Hacker’s Delight (in homage to the excellent book of the same name) is a pretty cool, sporadically-run coding challenge organized by Jim Hejl (previously of EA Tiburon, now Director of Rendering Technology out of EA in Vancouver). Each new HD challenge provides an interesting optimization problem to solve, where you have to consider both choice of algorithm, cache effects, and operation counts. Oh yeah, if it wasn’t clear: fastest implementation wins!

The challenge isn’t open to the public (or, at least, wasn’t last time I checked—what is the status these days Jim?), but even if you’re not allowed to compete it’s still a fun exercise to try to work out an efficient solution to the posted problems.

As the deadline is up for Hacker’s Delight #8 and I can’t spoil things for anyone, I thought I’d post about the solution I came up with. This time around the problem was to, given an M by N matrix of integers, perform a minimum number of whole row and column flips to obtain the matrix for which the sums of all rows and columns are non-negative, and also so that the sum of all row and column sums is non-negative and a minimum.

To help illustrate that mouthful, if we started with e.g. this matrix

   1    3   -4   -2
   3   -2   -1   -4
   2   -4   -3   -1
  -4    1    2   -3

then we’d want to flip the first row and the second and fourth column to turn the matrix into

  -1    3    4   -2
   3    2   -1    4
   2    4   -3    1
  -4   -1    2    3

where each row and column sum into a non-negative number (4, 8, 4, 0 for the rows; 0, 8, 2, 6 for the columns), and the sums of all these row and column sums is 32 (4+8+4+0+0+8+2+6 = 32) which is as small as it can be. Furthermore, 3 flips is the least number of flips in which to obtain a sum of 32.

OK, now that we understand the problem, the first thing to note is that we’re being asked to find a global maximum, which generally means we have to explore the whole search space, while pruning as much of the space as possible to make this search as fast as possible. The second thing to note is that the flipping of the sign of a row or column is commutative and its own inverse, so it follows you have to, at most, flip each row or column once. Furthermore, this means there is an upper bound of M + N flips and it doesn’t matter in which order the flips are performed.

Clearly then a brute-force solution is to form a K bit number (K = M + N) and test the flips for all values of K (when seen as a binary number where 1 means to flip the row or column) and report the best flip sequence. Of course, for the supplied test problem with M = 27 and N = 9 we get K = 38, and testing 238 combinations will take… ehrm… a while! So can we do better than the brute-force solution? Something that won’t take days to run? Why, yes Maude, turns out we can!

By the way, I’m not suggesting my solution below is the best possible solution. I wouldn’t be surprised if someone submitted a better solution (we’ll see when the results get posted) but I think it’s still interesting as an example of how one can address, in turn, choice of algorithm, operation count, and overall cache efficiency.

Improving the algorithm
So, after coding the brute-force solution and seeing it take forever to run (I left it running overnight and absolutely nothing had happened the next day) I had to switch gears. After some thinking my new algorithm was to loop only over all permutations of the 2N columns. For each column permutation, I then loop over all rows to find the ones that are negative and flip those. These are sort of the obvious flips you have to do.

Unfortunately, flipping negative rows is not enough! We must also test-flip all rows that are zero, because such a flip could turn a column sum of -C to +C. There could be several zero-rows and one or more permutations of flips of these could lead to a valid state, so we must try all permutations of zero rows. In practice there aren’t very many zero rows, but worst-case with all rows being zero, this brings the complexity back to that of the brute-force solution.

So why do we have to examine all column permutations in the first place? Why couldn’t we, say, just flip the columns that are negative? Here’s a simple example illustrating why we must also consider flipping columns that are positive. Consider this input matrix:

2   1   2
1   4   1
2   1   2

Both the row and column sums are 5, 6, and 5, for a total sum of 2*(5+6+5)=32. However, if we flip the middle column (temporarily to negative) and then flip the middle row, we end up with row and column sums of 3, 2, and 3, for a total sum of 2*(3+2+3)=16 which is the best we can get for this input. Thus, clearly, we have to consider flipping columns from positive to negative as well.

Improving the operation count
In the outer loop I am iterating over every column permutation and applying it as flips to the matrix. The naive approach would be to take the current binary number B from the loop range [0, 2N), flip the columns that represent the one-bits of B, perform the row operations, and then unflip the columns before moving on to the next permutation. As each flip is quite expensive, this naive approach would be very bad! (The next paragraph shows just how bad.)

A much improved approach is to count over the range [0, 2N) in Gray code. Using Gray coding we only need to flip one column per iteration of the outer loop! Using binary coding we would need to do N*2N-1-1 flips to achieve the same task, so Gray coding is a massive improvement. We obtain the Gray code G from B as G = B ^ (B >> 1). Of course, it’s not really the Gray code we want but instead the bit we need to flip to go from one Gray code to the next; this GΔ is given by GΔ = B ^ −B.

Also, before we do anything else we want to transpose the matrix, if necessary, so we have M > N, where M is the rows and N the columns. We do this because the 2M or 2N iterations over the outer loop dominate, and by transposing we ensure we do a minimal 2min(M,N) outer loop operations.

(Okay so you could argue that counting in Gray code really is sort of an algorithm change too, but I’m trying to make a point that operation counts, er, count too sometimes, so even if I’m stretching things a little it’s for a good cause.)

We can also get some additional improvements by writing our own versions of FlipSignRow() and FlipSignCol() that update the row and column sums (which we need to track pretty much at all times) at the same time as we perform the flip, instead of in a separate pass over the data.

Improving caching
Finally, note that we have a choice to loop over the rows or columns for the outer loop. Why I picked columns? For each outer loop (column) flip I will be doing one or more inner loop (row) flips. As we want the more-frequently occuring inner loop flips to be as cache efficient as possible, the inner loop flips should be row flips and therefore the outer loop flips should be column flips.

Conclusion
In terms of algorithm complexity we’ve gone from O(2M+N) for the brute-force approach, to something like O(2min(M,N)+Z*max(M,N)), where Z is some number proportional to the average number of zero rows. For the sample matrix provided with the problem statement, the brute-force solution wasn’t done even when given something like 8 hours to execute overnight. The improved algorithm on the other hand solves the given matrix in a tiny fraction of a second!

I guess we’ll see the HD#8 results posted in a bit, but in the meanwhile, does anyone have a better solution? It feels like there should exist more clever approaches, such as some ingenious branch-and-bound solution or something along those lines.

1 thought on “Hacker’s Delight #8: Matrix Flip”

Leave a Reply