Order your graphics draw calls around!
A while ago a local developer (hi Rick) asked about what I thought was the best way of ordering draw calls for sending to the graphics chip. As the topic came up again in a discussion at work recently, I thought I’d share my thoughts about what I think is the best approach these days.
“Best” is always a hard question to answer, because the best way (for whatever definition of best) on PS3 isn’t best on 360, best on NVIDIA isn’t best on ATI/AMD, best today isn’t best tomorrow and wasn’t best yesterday, etc. Because of the inherent problem of nailing down what’s best, I prefer a solution that’s easy to implement, allows easy experimentation with different ordering options, and that’s future proof. I’m willing to give up some (little) performance for this flexibility. My preferred solution to meet these goals, is simply to sort the draw calls using a standard sorting algorithm!
Sort-based draw call bucketing
How this works is that we, as we “draw” the scene, fill up an array of (key, value) pairs, where the key is an unsigned sort key that represents state/context, and the value is a pointer/offset to the actual draw call data. After we have processed all scene objects, we sort this array using a standard sorting algorithm (qsort, std::sort, something like that). After sorting we can now loop through the array and submit the draw calls for actual rendering, which are now sorted in the desired order.
Here, I’ll use a 64-bit key, but it could be bigger or smaller, depending on your needs. The 64-bit key is constructed in such a way that we dedicate some bits each to hold information such as:
- Fullscreen layer. Are we drawing to the game layer, a fullscreen effect layer, or the HUD?
- Viewport. There could be multiple viewports for e.g. multiplayer splitscreen, for mirrors or portals in the scene, etc.
- Viewport layer. Each viewport could have their own skybox layer, world layer, effects layer, HUD layer.
- Translucency. Typically we want to sort opaque geometry and normal, additive, and subtractive translucent geometry into separate groups.
- Depth sorting. We want to sort translucent geometry back-to-front for proper draw ordering and perhaps opaque geometry front-to-back to aid z-culling.
- Material. We want to sort by material to minimize state settings (textures, shaders, constants). A material might have multiple passes.
Within the key we dedicate a number of bits each to these different categories, making sure that categories we want to sort on first start at the MSB end of the key and go down towards the LSB. For the categories I just listed, you probably want to have them in pretty much the order I gave, which could result in a key looking like this:

Depending on what we want to achieve, sometimes we want to sort on depth before material, as is done with this key layout (for example when sorting translucent objects back-to-front). Other times we might want to sort on material before depth (typically for opaque geometry, as we want to reduce state settings as much as possible). For this reason I used an arrow in the figure to indicate that depth and material often switch place within the key. Note that as long as you are careful how you swap these two bitfields (depth and material ID), you can easily have both orderings be present within the same list of unsorted draw calls.
Rather than forming the key on every frame for every draw call, we can as an optimization cache the key with the draw calls of objects. Obviously the depth field (if present) must still be updated every frame, and the cached key must be invalidated if we were to ever swap materials on an object, or similar.
And, yes, the number of bits allocated above are a bit arbitrary, and for illustration purposes only. For example, you probably don’t need 24 bits for depth. Fortunately, the system makes it very easy for you to test out just how many bits you can decrease the depth bitfield down to before you start noticing a performance drop!
Adding commands
It is easy enough to extend this approach to allow for “commands” to be sorted along with your draw calls. By command I mean operations such as clearing of the depth buffer, switching render target, masking what components are being written to, and other types of state setting that we might want to perform between layers of draw calls.
We extend the system by adding a single bit between the translucency type and the depth field, to indicate if the bits below the command bit should be interpreted as a command instead of as additional bitfields. The command could be a direct command (an integer) or it could be a pointer to an area that contains more information that what can be stored in the limited integer bitfield. On adding this command bit, the sort key example would now look like this:

Also, when issuing commands we might sometimes want to have multiple commands issued and sorted together. This could be addressed by putting a “sequence number” at the top of the command bitfield, as illustrated in the figure above.
Commands can also be used to run special systems that you for one reason or another don’t want to embed inside this draw call array, such as perhaps some sort of vegetation/shrub renderer.
Other alternatives
The sort-based approach I described above is what we use on GoW3. A very similar system was used on Heavenly Sword as well (and likely lots of other titles too). For current platforms like the PS3, there really is no problem sorting 5,000 or even 10,000 draw calls each frame this way. (And, for the PS3, even if it was, sorting an array is an example of a task that the SPUs truly excel at.)
In the past, however, we had slightly different criteria and used a different solution. On the PS2 we were not willing to sacrifice speed for flexibility (and the needs for flexibility were lesser too) so we simply had predefined buckets for all the different combinations of layers, viewports, translucency, etc. and just inserted draw calls directly into the appropriate bin. On drawing, we would just process each bucket in whichever order we wanted to draw them, no sorting necessary! Obviously, this system is much faster with no sorting required, but it is also much less flexible, and doesn’t deal well with e.g. including depth sorting as part of the bucketing. (A very similar bucketing system was used on the PS1 as well.)










Groovin said,
October 3, 2008 @ 7:06 am
This is very similar to the method used by OpenSceneGraph, except they use a hybrid of your PS2 and PS3 methods. They draw into multiple buckets numbered by render pass. Then if a bucket has translucency it’s sorted by depth, otherwise its sorted by state.
erich666 said,
October 30, 2008 @ 1:16 pm
This seems pretty reasonable, but it does mean you traverse the whole database and sort before rendering anything. So a PC & GPU combination would have the GPU starved early on, the PC idle later, unless I’m missing some trick. I can imagine identifying the “lowest” sort key as “if you see this one, just render it now”, e.g. opaque objects in the main scene. In this way you could keep the GPU busy early on. Any comments?
christer said,
October 30, 2008 @ 10:07 pm
Eric, it is (in principle) correct that we have to traverse + sort everything before we can submit things for actual drawing, but this does not mean the GPU is not doing anything. Typically, games use what I call “delayed rendering” as opposed to “immediate rendering” (see my post on input latency for a full exposition of what I mean with these).
In delayed rendering the “rendering” that follows this traverse + sort step still doesn’t actually render anything; it just builds a command buffer that we later submit to the GPU (often, but not necessarily, at vertical sync). While all of this is taking place on the CPU side, the GPU is busy rendering the command buffer that was prepared on the previous game frame, and was kicked to the GPU at the start of this frame. (If this explanation is a mouthful, refer to the first figure in the input latency article, which should help in explaining how delayed rendering works.)
Now, if we were to use immediate rendering — which is what I think you were thinking of when you wrote your comment — then it is true that the traverse + sort step does delay when you can first submit a drawcall to the GPU. Even so, there are (system-dependent) things we can break out and handle as a “special case” before everything else is dealt with as per above. One example could be the rendering and accumulation of all shadow maps into a screen-space shadow accumulation buffer.
I hope that made sense.
fenghus said,
November 3, 2008 @ 8:54 am
We use something similar for our tools rendering; calculating a key also enables us to detect which objects can be batched together and drawn using mesh instanciation.
Real-Time Rendering » Blog Archive » This and That said,
November 20, 2008 @ 8:52 pm
[…] Christer Ericson’s article on dealing with grouping and sorting objects for rendering is excellent. It mostly depends on input latency, but has concepts that can be applied in immediate mode. […]
kenpex said,
December 5, 2008 @ 12:29 am
As you know that’s also the way Lost Planet renders things: http://meshula.net/wordpress/?p=112
slaaitjuh said,
June 15, 2009 @ 8:10 am
Hi,
I was just wondering how you would imagining something like Hardware Instancing in this system, since you are not dealing with depth anymore at that point… (well you are, but instances are not per definition in the same depth area)
nice article btw, looking into it. Looking at implementing it for XNA!
slaaitjuh said,
June 15, 2009 @ 9:29 am
Oh, and second question, how do you deal with shader parameter changes? For example, if you have a shared view / projection matrix and you want to render shadow maps (different view / project) or some security camera…
how do you trackback the material parameters?
christer said,
June 15, 2009 @ 9:04 pm
slaaitjuh, hardware instancing (or any other case where you need to effectively invalidate depth sorting, when depth is part of the sort key) is handled by assigning those draw calls a “unique depth”. Whether this is actually a unique depth in the depth field or done by inserting a separate bit above the depth field to turn the depth field into a “unique identifier” field, or some such, that’s up to you.
For your second question, recall I said “array of (key, value) pairs”. The value is a pointer to your draw call or similar where you have the material/state information available. Shader parameter changes you keep track of by doing a “diff” of the material/state information as you go from one array entry to the next.
Dietepiet said,
July 18, 2009 @ 1:58 am
Hi,
You said that on a PS2 you did no sorting at all while on the PS3 you did full sorting. Now I was wondering, if instead of re-inserting each render call each frame I only inserting them when they become visible and removing them when they become invisible. Then, each frame, the render queue will be pretty much the same as last frame. So instead of a full quick sort, I may use a very simple sort and only perform a few iterations to keep the queue more or less sorted.
Of course this is harder as you have to keep track of all render calls over multiple frames. Do you think this gives a significant saving on a current PC?
christer said,
July 18, 2009 @ 11:42 am
Dietepiet, read again; we did sort on the PS2, only we used bucket sorting. It is within the buckets we didn’t sort on the PS2. Not sorting within buckets was not a big deal on the PS2.
I doubt very much that there are any benefits to do incremental sorting. First, the code would become much more complex, which makes it hard to maintain and increases the likelihood for bugs. Second, the sorting just isn’t a bottleneck so why spend time on a more complicated solution if you don’t have to.
Threaded Renderer « luminance said,
August 25, 2009 @ 12:54 am
[…] basic approach is primarily inspired by Christer Ericson’s post about the approach he uses for ordering draw calls. Some other useful sources of information were Tom Forsyth’s post about the cost of […]
bakura said,
September 27, 2009 @ 7:59 am
I really like the idea. From what I understand, it allows to completely deffers the rendering at the end of the frame. For instance, iterating over all the passes but no directly make draw calls but only create render and state commands, and store them into a render queue. Then, when the iteration of the render passes is over, we can sort the commands of the render queue and call each commands. I’ve thought of adding a render pass parameter in the key, so that inter-dependance between passes is also handled. And when all the commands are iterated, we can now prepare the next frame by generating new commands. Sounds good.
I’m gonna take an example, please correct me if I’m wrong. You want to generate one shadow map per each light. You could create some commands that update the current light matrix, then render commands for the whole geometry, then another command that would update the light matrix,and again render commands… (but a lot of render commands would be duplicated :(.
However I have many problems still to solve (how to correctly update uniform variables…) but does it sound like a good idea ? Or generating many render and state commands like this sound too complicated ?