A while ago a local developer (hi Rick) asked about what I thought was the best way of ordering draw calls for sending to the graphics chip. As the topic came up again in a discussion at work recently, I thought I’d share my thoughts about what I think is the best approach these days.
“Best” is always a hard question to answer, because the best way (for whatever definition of best) on PS3 isn’t best on 360, best on NVIDIA isn’t best on ATI/AMD, best today isn’t best tomorrow and wasn’t best yesterday, etc. Because of the inherent problem of nailing down what’s best, I prefer a solution that’s easy to implement, allows easy experimentation with different ordering options, and that’s future proof. I’m willing to give up some (little) performance for this flexibility. My preferred solution to meet these goals, is simply to sort the draw calls using a standard sorting algorithm!
Sort-based draw call bucketing
How this works is that we, as we “draw” the scene, fill up an array of (key, value) pairs, where the key is an unsigned sort key that represents state/context, and the value is a pointer/offset to the actual draw call data. After we have processed all scene objects, we sort this array using a standard sorting algorithm (qsort, std::sort, something like that). After sorting we can now loop through the array and submit the draw calls for actual rendering, which are now sorted in the desired order.
Here, I’ll use a 64-bit key, but it could be bigger or smaller, depending on your needs. The 64-bit key is constructed in such a way that we dedicate some bits each to hold information such as:
- Fullscreen layer. Are we drawing to the game layer, a fullscreen effect layer, or the HUD?
- Viewport. There could be multiple viewports for e.g. multiplayer splitscreen, for mirrors or portals in the scene, etc.
- Viewport layer. Each viewport could have their own skybox layer, world layer, effects layer, HUD layer.
- Translucency. Typically we want to sort opaque geometry and normal, additive, and subtractive translucent geometry into separate groups.
- Depth sorting. We want to sort translucent geometry back-to-front for proper draw ordering and perhaps opaque geometry front-to-back to aid z-culling.
- Material. We want to sort by material to minimize state settings (textures, shaders, constants). A material might have multiple passes.
Within the key we dedicate a number of bits each to these different categories, making sure that categories we want to sort on first start at the MSB end of the key and go down towards the LSB. For the categories I just listed, you probably want to have them in pretty much the order I gave, which could result in a key looking like this:
Depending on what we want to achieve, sometimes we want to sort on depth before material, as is done with this key layout (for example when sorting translucent objects back-to-front). Other times we might want to sort on material before depth (typically for opaque geometry, as we want to reduce state settings as much as possible). For this reason I used an arrow in the figure to indicate that depth and material often switch place within the key. Note that as long as you are careful how you swap these two bitfields (depth and material ID), you can easily have both orderings be present within the same list of unsorted draw calls.
Rather than forming the key on every frame for every draw call, we can as an optimization cache the key with the draw calls of objects. Obviously the depth field (if present) must still be updated every frame, and the cached key must be invalidated if we were to ever swap materials on an object, or similar.
And, yes, the number of bits allocated above are a bit arbitrary, and for illustration purposes only. For example, you probably don’t need 24 bits for depth. Fortunately, the system makes it very easy for you to test out just how many bits you can decrease the depth bitfield down to before you start noticing a performance drop!
Adding commands
It is easy enough to extend this approach to allow for “commands” to be sorted along with your draw calls. By command I mean operations such as clearing of the depth buffer, switching render target, masking what components are being written to, and other types of state setting that we might want to perform between layers of draw calls.
We extend the system by adding a single bit between the translucency type and the depth field, to indicate if the bits below the command bit should be interpreted as a command instead of as additional bitfields. The command could be a direct command (an integer) or it could be a pointer to an area that contains more information that what can be stored in the limited integer bitfield. On adding this command bit, the sort key example would now look like this:
Also, when issuing commands we might sometimes want to have multiple commands issued and sorted together. This could be addressed by putting a “sequence number” at the top of the command bitfield, as illustrated in the figure above.
Commands can also be used to run special systems that you for one reason or another don’t want to embed inside this draw call array, such as perhaps some sort of vegetation/shrub renderer.
Other alternatives
The sort-based approach I described above is what we use on GoW3. A very similar system was used on Heavenly Sword as well (and likely lots of other titles too). For current platforms like the PS3, there really is no problem sorting 5,000 or even 10,000 draw calls each frame this way. (And, for the PS3, even if it was, sorting an array is an example of a task that the SPUs truly excel at.)
In the past, however, we had slightly different criteria and used a different solution. On the PS2 we were not willing to sacrifice speed for flexibility (and the needs for flexibility were lesser too) so we simply had predefined buckets for all the different combinations of layers, viewports, translucency, etc. and just inserted draw calls directly into the appropriate bin. On drawing, we would just process each bucket in whichever order we wanted to draw them, no sorting necessary! Obviously, this system is much faster with no sorting required, but it is also much less flexible, and doesn’t deal well with e.g. including depth sorting as part of the bucketing. (A very similar bucketing system was used on the PS1 as well.)
This is very similar to the method used by OpenSceneGraph, except they use a hybrid of your PS2 and PS3 methods. They draw into multiple buckets numbered by render pass. Then if a bucket has translucency it’s sorted by depth, otherwise its sorted by state.
This seems pretty reasonable, but it does mean you traverse the whole database and sort before rendering anything. So a PC & GPU combination would have the GPU starved early on, the PC idle later, unless I’m missing some trick. I can imagine identifying the “lowest” sort key as “if you see this one, just render it now”, e.g. opaque objects in the main scene. In this way you could keep the GPU busy early on. Any comments?
Eric, it is (in principle) correct that we have to traverse + sort everything before we can submit things for actual drawing, but this does not mean the GPU is not doing anything. Typically, games use what I call “delayed rendering” as opposed to “immediate rendering” (see my post on input latency for a full exposition of what I mean with these).
In delayed rendering the “rendering” that follows this traverse + sort step still doesn’t actually render anything; it just builds a command buffer that we later submit to the GPU (often, but not necessarily, at vertical sync). While all of this is taking place on the CPU side, the GPU is busy rendering the command buffer that was prepared on the previous game frame, and was kicked to the GPU at the start of this frame. (If this explanation is a mouthful, refer to the first figure in the input latency article, which should help in explaining how delayed rendering works.)
Now, if we were to use immediate rendering — which is what I think you were thinking of when you wrote your comment — then it is true that the traverse + sort step does delay when you can first submit a drawcall to the GPU. Even so, there are (system-dependent) things we can break out and handle as a “special case” before everything else is dealt with as per above. One example could be the rendering and accumulation of all shadow maps into a screen-space shadow accumulation buffer.
I hope that made sense.
We use something similar for our tools rendering; calculating a key also enables us to detect which objects can be batched together and drawn using mesh instanciation.
As you know that’s also the way Lost Planet renders things: http://meshula.net/wordpress/?p=112
Hi,
I was just wondering how you would imagining something like Hardware Instancing in this system, since you are not dealing with depth anymore at that point… (well you are, but instances are not per definition in the same depth area)
nice article btw, looking into it. Looking at implementing it for XNA!
Oh, and second question, how do you deal with shader parameter changes? For example, if you have a shared view / projection matrix and you want to render shadow maps (different view / project) or some security camera…
how do you trackback the material parameters?
slaaitjuh, hardware instancing (or any other case where you need to effectively invalidate depth sorting, when depth is part of the sort key) is handled by assigning those draw calls a “unique depth”. Whether this is actually a unique depth in the depth field or done by inserting a separate bit above the depth field to turn the depth field into a “unique identifier” field, or some such, that’s up to you.
For your second question, recall I said “array of (key, value) pairs”. The value is a pointer to your draw call or similar where you have the material/state information available. Shader parameter changes you keep track of by doing a “diff” of the material/state information as you go from one array entry to the next.
Hi,
You said that on a PS2 you did no sorting at all while on the PS3 you did full sorting. Now I was wondering, if instead of re-inserting each render call each frame I only inserting them when they become visible and removing them when they become invisible. Then, each frame, the render queue will be pretty much the same as last frame. So instead of a full quick sort, I may use a very simple sort and only perform a few iterations to keep the queue more or less sorted.
Of course this is harder as you have to keep track of all render calls over multiple frames. Do you think this gives a significant saving on a current PC?
Dietepiet, read again; we did sort on the PS2, only we used bucket sorting. It is within the buckets we didn’t sort on the PS2. Not sorting within buckets was not a big deal on the PS2.
I doubt very much that there are any benefits to do incremental sorting. First, the code would become much more complex, which makes it hard to maintain and increases the likelihood for bugs. Second, the sorting just isn’t a bottleneck so why spend time on a more complicated solution if you don’t have to.
I really like the idea. From what I understand, it allows to completely deffers the rendering at the end of the frame. For instance, iterating over all the passes but no directly make draw calls but only create render and state commands, and store them into a render queue. Then, when the iteration of the render passes is over, we can sort the commands of the render queue and call each commands. I’ve thought of adding a render pass parameter in the key, so that inter-dependance between passes is also handled. And when all the commands are iterated, we can now prepare the next frame by generating new commands. Sounds good.
I’m gonna take an example, please correct me if I’m wrong. You want to generate one shadow map per each light. You could create some commands that update the current light matrix, then render commands for the whole geometry, then another command that would update the light matrix,and again render commands… (but a lot of render commands would be duplicated :(.
However I have many problems still to solve (how to correctly update uniform variables…) but does it sound like a good idea ? Or generating many render and state commands like this sound too complicated ?
If I understood: all opaque objects are rendered at first, then all(!) objects with normal translucency (sorted back-to-front) and then all objects with add translucency (back-to-front too), because translucency is at more significant bits then depth? Is blending still correct in final composition?
“Typically we want to sort opaque geometry and normal, additive, and subtractive translucent geometry into separate groups.” can be parsed as either
{ opaque } and { normal, additive, and subtractive translucent }
or as
{ opaque }, { normal translucent }, { additive translucent }, and { subtractive translucent }
to be the separate groups. These two alternatives will give different results, obviously.
Visually, the former is seen as correct (as long as you sort the translucent group correctly), but for games in particular we have traditionally compromised on correctness and done the latter, to cut down on state changes (to make rendering faster).
The example above though is just that though: an example. You can adjust the mapping such that you have depth be more significant than translucency if that works better for your application. You can even split it, so there’s some depth bits both above and below the translucency type. And that’s the beauty of sort-based draw call ordering!
Hi Christer, first of all, great book, great blog and great games, nice job! You have to write a book about game engines, I would be first to buy .
Now about this key,value scheme that you suggested here. I wanted to know why are you adding the command id itself to the key (I’m referring to the second figure), doesn’t it make more sense to have the command id on the value part, perhaps among with some parameters for the command? We don’t need to sort the commands based on their memory address (in case of a function ptr) right?
CodeLord, I described commands in the way I did for two primary reasons.
First, I wanted to indicate that several fields of the key are irrelevant for a command key and can be reused for relevant data. This is nice because if all data you need fits within the key, you don’t need to even fetch the value field. Of course, you can use the value field to store data too like you outlined. Whichever works best for you.
Second, I wanted to introduce the concept of one (or more) “sequence ID” field within the key for a command, which would allow several commands to be sorted together (which is actually a likely need). This second reason is why you’d want to put these bits in the key instead of the value field.
Thanks for this post! It helped me get a good idea of where to begin with my render queue (i’m currently not sorting at all, just rendering).
One thing i’m wondering though, at the lower level, do you optimize anything with your VBO’s? For example, if you have a few small models (chairs, tables, whatever), do you have an equal amount of VBO’s or do you combine them into a larger one somewhere and then just render parts? And if so, at what point do you do that? Or do you just have a whole bunch of smaller vbo’s, accepting the statechanges for those?
One idea i had was that whenever i load the world, it will determine the primitive count (triangles usually) of all the static objects and spread them over various VBO’s (i have yet to find the “magic” number of vbo size). When the vbo is “full” it will add the next model to a new vbo and so on. What is your view on this idea?
Another thing i was wondering; in most modern games you’d rarely see the same texture used twice, as about 90% of the rendered scene would be unique for materials (except maybe for the used shader). Would you then choose to only depth-sort and completely ignore materials if thats the case? In a typical scene i’ve seen theres barely any texture used twice, and if it is it’d be something used a lot, which would then be instanced anyway
Robin, things differ greatly between PCs (which you’re asking about) and consoles (which I work on). On PCs you have driver overhead, on consoles we don’t. On PS3 we pretty much don’t worry about small draw calls at all. So the VBO issues you ask about are pretty alien to me, so I cannot really provide reliable recommendations.
For the material and texture sorting question, the same applies. On PC you’ll have the driver doing things behind your back somewhat, whereas we explicitly upload textures, and vertex and fragment shaders. We also explicitly handle patching of parameters to the shaders etc. so we know exactly in what order things will be presented to the hardware and we can therefore sort things to account for that. There’s no uploading of textures or shaders “behind our back.”
Perhaps someone more familiar with the atrocities performed by PC drivers can comment directly on your questions!
Hi Christer, let me tell you I learnt a lot from your book, and your non-conformist blog is really cool too!
This might be a naive question as I’m not graphics programmer, I just dabble into it once in a while. Anyway, would you see a way how to handle scoped commands within your system? By “scoped” I mean things like glPushMatrix()/glPopMatrix() or glEnable(GL_BLEND)/glDisable(GL_BLEND) command pairs, anything that defines a scope of sorts, sets a state for the scope that anything rendered within the scope is affected by.
If I used the extension for commands you describe I’d still have to make sure that the draw calls generated within a scope (and no others) are still between the scope delimiting commands after sorting. That could be done by introducing the “scope” into the sorting key somehow, probably close to its MSB. However, some possible use cases I have in my head seem to indicate there could be clashes trying to determine the importance of the categories.
Hi Christer,
I have a question regarding the depth key.
I have all the depth values in floats, and I want to use those values in the key.
What is the best way to encode floats into ‘bits’ (or integer) so that I can use it as part of the key ?
Thanks!
As you mention a key/value system I assume you have to use map or unordered_map but is the former or the latter better? map is already sorted upon insertion whereas the unordered map would still need to be sorted. I’m not even sure it’s straightforward to do with algorithm’s sort(it.begin(),it.end()); or something like this. Another caveat is that internally it uses a hashed key…
Frederic, your question is addressed in the third paragraph. I’m suggesting filling in an array, with each new entry, and just using any standard sorting algorithm to sort the array when it is filled. The cost of the sort is negligible as it takes place just once a frame.
ricardoquesada, you can e.g. normalize the floats so they are in the same exponent range with the same sign, at which point only the fraction bits of the floats need to be considered (of which there are 23 bits plus one implicit bit which is always 1). Pick the top N bits of the fraction bits, and store those in your key.
We use something similar for our tools rendering; calculating a key additionally allows us to locate which items may be batched collectively and drawn the use of mesh instanciation.