Optimizing the rendering of a particle system

There are many things that can kill the frame rate in a modern game, and particles are up near the top of the list of causes. A key contributing factor is that particles are subject to a lot of overdraw that is not present in your opaque geometry.

The reason for the increase in overdraw is that for particles we tend to have lots of individual primitives (usually quads) that are overlapped, perhaps to mimic effects like fire or smoke. Normally each particle primitive is translucent (alpha-blended) so the z-buffer is not updated as pixels are written and we end up rendering to pixels multiple times. (In contrast, for opaque geometry we do write to the z-buffer, so between a possible z-prepass, sorting objects front-to-back, hierarchical z-culling on the GPU, and normal depth testing, the result is that we have very little overdraw.)

Overdraw, in turn, leads to increased uses of both fillrate (how many pixels the hardware can render to per second) and bandwidth (how much data we can transfer to/from the GPU per second), both of which can be scarce resources.

Of course, overdraw is not the only reason particles can slow your frame rate to a crawl. We can also get bitten by other problems, like setting too much state for each particle or particle system.

What to do?

OK, let’s say we all agree that particles can cause a lot of problems. What to do? Fortunately there are lots of things that can be done to optimize the rendering side of a particle system. For whatever reason, few of these are discussed in books or articles, so I thought I’d list out a few things we can do. Feel free to add more suggestions as comments.

  • Use opaque particles. For example, make smoke effects really thick so (some or all of) the particle billboards can be opaque, with cutout alpha. For some particles, like shrapnel, rocks, or similar, use lightweight geometry particles instead of sprites with alpha borders.
  • Use richer particles. Put more oomph in a single particle sprite so we need fewer of them. Use flipbook textures for creating billowing in e.g. fire and smoke, rather than stacking sprites.
  • Reduce dead space around cutout-alpha particles. Use texkill to not process the transparent regions of the sprite. Alternatively, trim away the transparent areas around the particle, using an n-gon fan instead of just a quad sprite. (but beware lowered quad (2×2 pixel block) utilization when increasing the triangle count, or becoming vertex bound in the distance, so LOD from fan to a quad sprite in the distance).
  • Cap total amount of particles. Use hardware counters on the graphics card to count how many particle pixels have been rendered and stop emitting or drawing particles when passing a certain limit (which can be set dynamically).
  • Use frequency divider to reduce data duplication. We can reduce bandwidth and memory requirements by sharing data across particle vertices using the frequency divider, instead of through data duplication across vertices. (Arseny Kapoulkine described this well in Particle rendering revisited.)
  • Reduce state changes. Share shaders between particles. We can make this happen by e.g. dropping features for distant particles (such as dropping the normal map as soon as possible).
  • Reduce the need for sorting. Rely on additively or subtractively blended particles where possible. E.g. additive particles can be drawn in any order, so we can sort them to e.g. reduce state changes instead of sorting on depth (as is likely needed for normal alpha-blended particles).
  • Draw particles after AA-resolve. Most games today use multisampled antialiasing (MSAA), drawing to a 2x or 4x MSAA buffer. These buffers must be resolved (with an AA-resolve pass) into a non-MSAA buffer before display. Due to the way MSAA works, we still run the pixel shader equally many times whether we draw particles before or after the AA-resolve, but by drawing particles after the AA-resolve we drastically reduce frame buffer reads and writes, ROP costs, etc.
  • Draw particles into a smaller-resolution buffer. We can also draw particles into a separate, smaller-resolution buffer (smaller than the frame buffer or the AA-resolved buffer). The exact details vary depending on e.g. whether you use RGBA or FP16 frame buffers, but the basic idea is the following. First we draw the opaque geometry into our standard frame buffer. We then shrink the resulting corresponding z-buffer down to 1/4 or 1/16 size, then draw particles to 1/4 or 1/16 size frame buffer using the smaller z-buffer. After we’re done, we scale the smaller frame buffer back up and composite it onto the original frame buffer. (Some interesting details I’ve left out are how exactly to perform the z-buffer shrinking and the composite steps.)
  • Use MSAA trick to run pixel shader less. On consoles (and on PC hardware if drivers allowed it) you can tell the GPU to treat, say, an 800×600 frame buffer as if it were a 400×300 (ordered) 4xMSAA buffer. Due to the way MSAA works, this has the effect of running the pixel shader only once per 2×2 pixels of the original buffer, at the cost of blurring your particles equivalently much. (Though you still get the benefit of antialiasing at the edges of the particles.)
  • Generate particles “on chip.” On newer (PC) hardware we can use geometry shaders to generate particles on the GPU instead of sending the vertex data from the CPU. This saves memory and bandwidth.

We can also attempt some more esoteric stuff, like:

  • Compose particles front-to-back premultiplied-alpha style. Using premultiplied alpha (which is associative) we can blend particles front-to-back instead of the normal back-to-front ordering. The idea here is to use the front-to-back drawing to fill in depth or stencil buffer when alpha has become (near) solid and ultimately stop drawing particles all together (when they no longer contribute much to the visual scene, off in the distance).
  • Group particles together into one particle entity. Instead of drawing two overlapping particles individually, we can form a single (larger) particle that encompasses the two particles and performs the blending of the two particles directly in the shader. This tends to reduce the amount of frame buffer reads we do (as we now only have to blend one particle) but it can also increase it (if the single particle covers much more area than the union of the two original particles).

My original intent was to categorize all these items in terms of what they were saving (overdraw, fillrate, ROP, pixelshader ALU, etc.) but it was a little bit more effort than I had time for to create the 2D matrix of feature vs. savings this would create. Hopefully it’s still clear what each bulletpoint-ed task would achieve.

All in all, the above’s a long list, but I’m sure I left something out, so please comment. Also, what’s your favorite “trick?”

23 thoughts on “Optimizing the rendering of a particle system”

  1. Great post Christer.

    Something similar to particles grouping can be done with particles that are meant to represent a participating media. A single primitive (a quad, a disc, etc..) might be used as convex hull in conjunction with a more sophisticated pixel shader which computes the amount of light absorbed (and emitted/scattered) inside the hull volume.

    I guess many games use a similar approach of god rays, columns of smoke, etc..

    Marco

  2. Keep interpolator overhead low – for simple effects like smoke, you can easily become interpolator-bound. This can be prevalent in engines that use shader generators (i.e. Unreal). They can often send data over from the vertex shader that isn’t actually needed in simple particle effects to support their wide range of artist-exposed features. It is pretty trivial to modify the shader generator to track which interpolators are needed and modify the code generation. (I take no side in the shader generator good/bad debate).

  3. > (Though you still get the benefit of antialiasing at the edges of the particles.)

    I know that Lost Planet does that, but what’s the benefit of antialiasing when most particles are semi transparent and are using texkill/alpha testing? It seems to me that the main advantage of this method is that it lets you use the same zbuffer while rendering to a lower resolution render target.

    Using texture atlases also helps reducing the number of draw calls, which is still an important factor on PC. This can become very significant if sorting particles individually and have multiple particle types in the same effect. On modern hardware you can also use texture arrays, and I’ve seen people use cubemaps to simulate 6 element arrays.

  4. > Generate particles “on chip.”

    Render to vertex buffer (R2VB) on pre-DX10 GPUs is an option as well (use in combination with the frequency divider). On DX10 and beyond, one can use vertexID and vertex texture fetch to generate the entire particle system without making use of the slower geometry shader.

    > Use richer particles.

    In combination with Marco’s comment, if depth fetch is available, spherical billboards (also referred to as soft particles) can both reduce the number of particles/overdraw required for a given effect, and also solve billboard/opaque geometry intersection problem. Making use of a downsized depth buffer (if available) could improve texture cache performance and reduce the extra cost of sampling depth.

    > Use opaque particles.

    Mega particles is another option. Draw opaque particles (low triangle object per particle) front first into a secondary buffer (perhaps even reduced resolution buffer). Optionally write out particle material properties. Optionally do deferred lighting on this opaque particle buffer for lit particles. Then use one or more image space passes to both add high frequency detail and diffuse the hard opaque particle outlines (material properties could be used to adjust effect). Finally merge this buffer back with the opaque geometry using depth from both the mega-particle buffer and opaque buffer to generate a smooth blend.

    > more esoteric stuff,

    Could run something similar to Stam’s Stable Fluids but in 2.5D to run a fluid simulation as an image space process. Could seed CFD system from data drawn by opaque particles into a lower resolution buffer…

  5. I was curious about drawing particles front-to-back with pre-multiplied alpha. How would you go about doing this? Would it involve ping-ponging between two buffers? Pseudo-code would be helpful, if possible. Thanks. :)

  6. Ben, Marco, Vince, Ignacio, and Tim: thanks for your comments, all solid suggestions!

    n00body, there are probably a few variants, but the most straightforward one involves drawing your opaque geometry into frame buffer (FB) 1, using depth buffer (DB) 1. After that, you render particles (and other translucent stuff) into FB 2 in front-to-back order using premultiplied alpha, depth-testing using DB 1. Lastly you composite FB 2 onto FB 1 using the composited alpha of FB 1. Your final result is in FB 1.

  7. Hi, good post with some good points.

    I’m also confused about the front-to-back rendering though. When rendering and blending the particles and other translucent stuff into FB 2, you still have to cope with each particle blending with each other and this is where premultiplied alpha, which while associative still isn’t commutative, so you can’t just render in the other order. Pre-multiplied alpha means that you are able to combine the particles (back-to-front) into one buffer and then composite into a new buffer as if you had rendered to it originally.

    Or have I missed something?

    Cheers.

  8. Mize, switching from back-to-front to front-to-back requires associativity; it does not require commutativity. Consider having particles A, B, and C, and a frame buffer FB 1. Using the over operator, we would perform back-to-front drawing like so:

    (A over (B over (C over FB 1))).

    Using the properties of associativity, this is equivalent to

    (((A over B) over C) over FB 1).

    In my description 3 comments up, what we draw into FB 2 is ((A over B) over C), after which we composite FB 2 onto FB 1.

    I’m not sure how to make this more clear than this, but if you’re still confused you might want to read the links of the fifth paragraph of this earlier blog post of mine.

  9. Ah, think I’ve got you, was getting confused with blend modes.
    Still had it in my head as Src_rgb * 1 + Dest_rgb * (1-Src_a), which would need swapping to take Destination alpha rather than src after the swap in direction. Interesting, going to go and test that out now :)

    Thanks

  10. > Compose particles front-to-back premultiplied-alpha style.

    A few updates on using stencil to limit overdraw in this case by incrementing stencil on draw and killing pixels over stencil limit.

    When using stencil in this case, NVidia 8 series and later hardware cannot coarse cull geometry by stencil when stencil is both used and written to in the same draw call. However NVidia 8 series hardware has advanced fine-grain Early-Z hardware which does work in this case and stops fragments at fine raster step (no shading happens). I’ve seen huge gains in this case (my problem was unique however in that I had huge amounts of overdraw as well).

    However, if texkill is used in the fragment shader then the Early-Z gets disabled, so what otherwise might be a good optimization (killing fragments in the fragment shader which are below some alpha threshold), turns out to be a really bad idea when using stencil.

    Unfortunately NVidia 6/7 series hardware lacks the fine-grain Early-Z stencil. Stencil write + use in the same draw call disables the stencil overdraw optimization. However stencil overdraw limiting might still be useful with Marco’s example (hull volume shader). The idea here being that one would draw front first ordering of pairs of {stencil only hull volume draw call to increment stencil, hull volume draw call with stencil cull}. Disadvantage of lots of draw calls (and state changes), advantage of getting fast coarse stencil-cull (if alphatest and texkill is NOT used) to limit overdraw. Would need lots of overlapping hulls to justify the optimization however…

  11. Got the front-to-back working, I guess the bit I missed and tried to explain in my last post was going from: (A over (B over (C over FB 1))) to (((A over B) over C) over FB 1) to ((C under (B under A)) over FB 1)

    Tim, Interesting, so not necessarily a win then, Do you think there be any scope in rendering the first, say, 10% of particles out front-to-back to your offscreen buffer, accumulating the alpha’s and writing depth (but not depth testing), then doing a fullscreen pass to your offscreen buffer clearing any depths where your alpha accumulation is less than a certain amount, so you’d end up with the depth channel initialised with just your opaque areas. Then rendering the rest of the particles depth-testing for early rejection?

    Also, has anyone tried any techniques to eliminate the blocky stepping artifacts when rendering to a low-res buffer, and depth testing against a downsampled scene depth buffer, and compositing back with the main scene?

  12. Mize, “so not necessarily a win then”

    Stencil overdraw reduction using fine Early-Z, sure not on PS3 (GeForce 7 based hardware), but definitely a win on DX10 hardware, so could be quite useful. As per your question, I’d probably have to try it in your case (would likely depend a lot on the cost of your particles, how much transparency, etc), then profile to know for sure. As per low-res artifacts, if I remember right, there is an article in GPU Gems 3 (which NVidia is in progress of placing online now) which describes a method to remove edge artifacts on low-res particle rendering. Effectively you do a stencil pass to check for edges, then re-render particles at full resolution in those areas with stencil test on.

  13. “so not necessarily a win then”
    Sorry, was thinking out loud a bit there. Don’t think it’ll be a win in my particular case ;) Wont hurt to try it though.

    Interesting you should mention the GPU Gems 3 article though, I actually implemented that very thing, but although the visual results were great the cost of the overdraw on just the high-res edges proved too great :( Ended up blurring the low-res buffer in areas where the blockiness was going to occur, which worked nicely.

  14. very useful topic, thanks.

    I have tried to optimize my DX10 project, which I used GS to animate and draw quads from point list.
    the example can be found on my website.

    Here is what I found, using GS helps saving a huge bandwidth where the data need to be transfer from CPU to GPU, but it could increase a huge load on fragment processing (my performance has been dratically dropped when I tried increased 2x to my billboard size)

    having something like GS frustum culling could help improve this situation. Using the same idea, reducing the size of the billboard (on GS) if its Z is very far from the camera. this approach bumped my FPS up back to where it was before :)

    btw, have anyone got any idea to transfer the data from MSAA buffer to the normal buffer, and vice versa under DirectX 10 API?
    my grass rendering use MSAA, but nothing else. If it’s possible I like to reduce the operation on MSAA buffer as much as possible :(

    thanks in advance

  15. It can also be beneficial to render particles before rendering the rest of the scene and after the Z pass, and update the Z buffer to zero where particles are fully opaque. This way when the particles are covering big chunk of the screen, the rest of the frame render is a lot cheaper as it’s mostly discarded. Obviously this brings some extra overhead, like an additional resolve, but if the frame has a lot of particles it will balance the GPU load assuming that the rest of the scene is reasonably complex.

  16. “Using premultiplied alpha (which is associative) we can blend particles front-to-back instead of the normal back-to-front ordering.”

    Well technically speaking associativity of a certain blend operation has nothing to do with the representation of color (ie. premultiplied vs. non-premultiplied): it is the Porter-Duff source-over blend-mode that is associative here. And not a lot of people know this but this trick can lower blending quality – it depends on internal precision (and of course the case), but in short there is a problem in monotonicity due to loss of information (in the 2d-parameter space [alpha, channel value]) when going front-to-back.
    (Also it is possible that the artefacts can only be seen in cases where single particles are large wrt. screen and the artefacts in 8-bits of precision looks mostly like very subtle noise (in 565 output you can see saw-tooth kind of patterns in the banding-lines, which of course can be helped with dithering)).

    But of course if you can avoid touching the framebuffer after reaching certain opacity-level the performance gain can be huge – can be thought also as primary rays of ray-tracing; the ray travels only as far as it has to.

  17. One can also bin particles into screen-space tile jobs, and hand the jobs to a many-core processor (e.g. SPU, CUDA, OpenCL.) Each core independently sorts its bin by depth, reads a tile of color+depth, and renders particles in order to fast local memory. Only at the very end, the tile of color is written back to video memory exactly once.

    This guarantees 1x overdraw regardless of particle count, though of course the cost of binning and sorting increases with particle count.

    SPU has no texture fetch, so it may be less suitable for this technique, particularly if particles rotate etc.

Leave a Reply