A while ago a local developer (hi Rick) asked about what I thought was the best way of ordering draw calls for sending to the graphics chip. As the topic came up again in a discussion at work recently, I thought I’d share my thoughts about what I think is the best approach these days.
“Best” is always a hard question to answer, because the best way (for whatever definition of best) on PS3 isn’t best on 360, best on NVIDIA isn’t best on ATI/AMD, best today isn’t best tomorrow and wasn’t best yesterday, etc. Because of the inherent problem of nailing down what’s best, I prefer a solution that’s easy to implement, allows easy experimentation with different ordering options, and that’s future proof. I’m willing to give up some (little) performance for this flexibility. My preferred solution to meet these goals, is simply to sort the draw calls using a standard sorting algorithm!
Sort-based draw call bucketing
How this works is that we, as we “draw” the scene, fill up an array of (key, value) pairs, where the key is an unsigned sort key that represents state/context, and the value is a pointer/offset to the actual draw call data. After we have processed all scene objects, we sort this array using a standard sorting algorithm (qsort, std::sort, something like that). After sorting we can now loop through the array and submit the draw calls for actual rendering, which are now sorted in the desired order.
Here, I’ll use a 64-bit key, but it could be bigger or smaller, depending on your needs. The 64-bit key is constructed in such a way that we dedicate some bits each to hold information such as:
- Fullscreen layer. Are we drawing to the game layer, a fullscreen effect layer, or the HUD?
- Viewport. There could be multiple viewports for e.g. multiplayer splitscreen, for mirrors or portals in the scene, etc.
- Viewport layer. Each viewport could have their own skybox layer, world layer, effects layer, HUD layer.
- Translucency. Typically we want to sort opaque geometry and normal, additive, and subtractive translucent geometry into separate groups.
- Depth sorting. We want to sort translucent geometry back-to-front for proper draw ordering and perhaps opaque geometry front-to-back to aid z-culling.
- Material. We want to sort by material to minimize state settings (textures, shaders, constants). A material might have multiple passes.
Within the key we dedicate a number of bits each to these different categories, making sure that categories we want to sort on first start at the MSB end of the key and go down towards the LSB. For the categories I just listed, you probably want to have them in pretty much the order I gave, which could result in a key looking like this:
Depending on what we want to achieve, sometimes we want to sort on depth before material, as is done with this key layout (for example when sorting translucent objects back-to-front). Other times we might want to sort on material before depth (typically for opaque geometry, as we want to reduce state settings as much as possible). For this reason I used an arrow in the figure to indicate that depth and material often switch place within the key. Note that as long as you are careful how you swap these two bitfields (depth and material ID), you can easily have both orderings be present within the same list of unsorted draw calls.
Rather than forming the key on every frame for every draw call, we can as an optimization cache the key with the draw calls of objects. Obviously the depth field (if present) must still be updated every frame, and the cached key must be invalidated if we were to ever swap materials on an object, or similar.
And, yes, the number of bits allocated above are a bit arbitrary, and for illustration purposes only. For example, you probably don’t need 24 bits for depth. Fortunately, the system makes it very easy for you to test out just how many bits you can decrease the depth bitfield down to before you start noticing a performance drop!
It is easy enough to extend this approach to allow for “commands” to be sorted along with your draw calls. By command I mean operations such as clearing of the depth buffer, switching render target, masking what components are being written to, and other types of state setting that we might want to perform between layers of draw calls.
We extend the system by adding a single bit between the translucency type and the depth field, to indicate if the bits below the command bit should be interpreted as a command instead of as additional bitfields. The command could be a direct command (an integer) or it could be a pointer to an area that contains more information that what can be stored in the limited integer bitfield. On adding this command bit, the sort key example would now look like this:
Also, when issuing commands we might sometimes want to have multiple commands issued and sorted together. This could be addressed by putting a “sequence number” at the top of the command bitfield, as illustrated in the figure above.
Commands can also be used to run special systems that you for one reason or another don’t want to embed inside this draw call array, such as perhaps some sort of vegetation/shrub renderer.
The sort-based approach I described above is what we use on GoW3. A very similar system was used on Heavenly Sword as well (and likely lots of other titles too). For current platforms like the PS3, there really is no problem sorting 5,000 or even 10,000 draw calls each frame this way. (And, for the PS3, even if it was, sorting an array is an example of a task that the SPUs truly excel at.)
In the past, however, we had slightly different criteria and used a different solution. On the PS2 we were not willing to sacrifice speed for flexibility (and the needs for flexibility were lesser too) so we simply had predefined buckets for all the different combinations of layers, viewports, translucency, etc. and just inserted draw calls directly into the appropriate bin. On drawing, we would just process each bucket in whichever order we wanted to draw them, no sorting necessary! Obviously, this system is much faster with no sorting required, but it is also much less flexible, and doesn’t deal well with e.g. including depth sorting as part of the bucketing. (A very similar bucketing system was used on the PS1 as well.)