There are many things that can kill the frame rate in a modern game, and particles are up near the top of the list of causes. A key contributing factor is that particles are subject to a lot of overdraw that is not present in your opaque geometry.
The reason for the increase in overdraw is that for particles we tend to have lots of individual primitives (usually quads) that are overlapped, perhaps to mimic effects like fire or smoke. Normally each particle primitive is translucent (alpha-blended) so the z-buffer is not updated as pixels are written and we end up rendering to pixels multiple times. (In contrast, for opaque geometry we do write to the z-buffer, so between a possible z-prepass, sorting objects front-to-back, hierarchical z-culling on the GPU, and normal depth testing, the result is that we have very little overdraw.)
Overdraw, in turn, leads to increased uses of both fillrate (how many pixels the hardware can render to per second) and bandwidth (how much data we can transfer to/from the GPU per second), both of which can be scarce resources.
Of course, overdraw is not the only reason particles can slow your frame rate to a crawl. We can also get bitten by other problems, like setting too much state for each particle or particle system.
What to do?
OK, let’s say we all agree that particles can cause a lot of problems. What to do? Fortunately there are lots of things that can be done to optimize the rendering side of a particle system. For whatever reason, few of these are discussed in books or articles, so I thought I’d list out a few things we can do. Feel free to add more suggestions as comments.
- Use opaque particles. For example, make smoke effects really thick so (some or all of) the particle billboards can be opaque, with cutout alpha. For some particles, like shrapnel, rocks, or similar, use lightweight geometry particles instead of sprites with alpha borders.
- Use richer particles. Put more oomph in a single particle sprite so we need fewer of them. Use flipbook textures for creating billowing in e.g. fire and smoke, rather than stacking sprites.
- Reduce dead space around cutout-alpha particles. Use texkill to not process the transparent regions of the sprite. Alternatively, trim away the transparent areas around the particle, using an n-gon fan instead of just a quad sprite. (but beware lowered quad (2×2 pixel block) utilization when increasing the triangle count, or becoming vertex bound in the distance, so LOD from fan to a quad sprite in the distance).
- Cap total amount of particles. Use hardware counters on the graphics card to count how many particle pixels have been rendered and stop emitting or drawing particles when passing a certain limit (which can be set dynamically).
- Use frequency divider to reduce data duplication. We can reduce bandwidth and memory requirements by sharing data across particle vertices using the frequency divider, instead of through data duplication across vertices. (Arseny Kapoulkine described this well in Particle rendering revisited.)
- Reduce state changes. Share shaders between particles. We can make this happen by e.g. dropping features for distant particles (such as dropping the normal map as soon as possible).
- Reduce the need for sorting. Rely on additively or subtractively blended particles where possible. E.g. additive particles can be drawn in any order, so we can sort them to e.g. reduce state changes instead of sorting on depth (as is likely needed for normal alpha-blended particles).
- Draw particles after AA-resolve. Most games today use multisampled antialiasing (MSAA), drawing to a 2x or 4x MSAA buffer. These buffers must be resolved (with an AA-resolve pass) into a non-MSAA buffer before display. Due to the way MSAA works, we still run the pixel shader equally many times whether we draw particles before or after the AA-resolve, but by drawing particles after the AA-resolve we drastically reduce frame buffer reads and writes, ROP costs, etc.
- Draw particles into a smaller-resolution buffer. We can also draw particles into a separate, smaller-resolution buffer (smaller than the frame buffer or the AA-resolved buffer). The exact details vary depending on e.g. whether you use RGBA or FP16 frame buffers, but the basic idea is the following. First we draw the opaque geometry into our standard frame buffer. We then shrink the resulting corresponding z-buffer down to 1/4 or 1/16 size, then draw particles to 1/4 or 1/16 size frame buffer using the smaller z-buffer. After we’re done, we scale the smaller frame buffer back up and composite it onto the original frame buffer. (Some interesting details I’ve left out are how exactly to perform the z-buffer shrinking and the composite steps.)
- Use MSAA trick to run pixel shader less. On consoles (and on PC hardware if drivers allowed it) you can tell the GPU to treat, say, an 800×600 frame buffer as if it were a 400×300 (ordered) 4xMSAA buffer. Due to the way MSAA works, this has the effect of running the pixel shader only once per 2×2 pixels of the original buffer, at the cost of blurring your particles equivalently much. (Though you still get the benefit of antialiasing at the edges of the particles.)
- Generate particles “on chip.” On newer (PC) hardware we can use geometry shaders to generate particles on the GPU instead of sending the vertex data from the CPU. This saves memory and bandwidth.
We can also attempt some more esoteric stuff, like:
- Compose particles front-to-back premultiplied-alpha style. Using premultiplied alpha (which is associative) we can blend particles front-to-back instead of the normal back-to-front ordering. The idea here is to use the front-to-back drawing to fill in depth or stencil buffer when alpha has become (near) solid and ultimately stop drawing particles all together (when they no longer contribute much to the visual scene, off in the distance).
- Group particles together into one particle entity. Instead of drawing two overlapping particles individually, we can form a single (larger) particle that encompasses the two particles and performs the blending of the two particles directly in the shader. This tends to reduce the amount of frame buffer reads we do (as we now only have to blend one particle) but it can also increase it (if the single particle covers much more area than the union of the two original particles).
My original intent was to categorize all these items in terms of what they were saving (overdraw, fillrate, ROP, pixelshader ALU, etc.) but it was a little bit more effort than I had time for to create the 2D matrix of feature vs. savings this would create. Hopefully it’s still clear what each bulletpoint-ed task would achieve.
All in all, the above’s a long list, but I’m sure I left something out, so please comment. Also, what’s your favorite “trick?”