🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Detecting state changes in bit-based render queue

Started by
4 comments, last by J. Rakocevic 4 years, 6 months ago

Hello. I'm a new member here, I've been lurking and reading for a while but now I ran into a conundrum over a specific detail in the rendering queue implementation based on bit key sorting. I've read plenty of posts by Hodgman and L. Spiro pertaining to the topic, and it was really helpful and eye opening (thank you!) but I couldn't find an answer to this specific question so pardon me if it's a duplicate.

So - I have my scene, with a vector of actors in it (has models, bounding spheres etc) - I frustum cull it, I send all renderable objects to the rendering queue (as of now unsorted) as I iterate through the scene. That part is clear.

Then I compute the bit keys (using uint64_t right now, I don't have that many options implemented as of yet and that seems enough for starters) and I call a sort based on that. That works as well, it sorts the way I want it to. Great.

Now I have this sorted array of renderable indices... and boom - the problem rears its ugly head. As I go down this sorted array (or well, vector) and call render on the items in it, how can I detect when to switch states? As in, there is no... cutoff, so to speak, that says at index i, this material is done, use the new one. An approach using a "map" of arrays should trivially handle this (I think, however I figure this is not cache friendly at all), but I don't know how to do it with a single array...

So, how can I tell - ok, at this point I'm done with this state, switch from shader/texture/buffer a to b, then continue rendering. I saw the Christer Ericson's blog part where he mentions adding commands to the key but... I seriously don't even know where to begin with encoding something like - call this function to execute x state change with these parameters - in some few bits in the key... An alternative is to keep if-checking for every state change, every draw... sure, this is still better than unordered calls, but isn't that... still slow? I'm aware ifs can be fast but that can be a lot of checks!

I searched a bit and read that DX11.1 (that's my poison right now) supposedly already discards unnecessary state switches (aka if I bind the same texture to the same slot again it will just do nothing) but that seems like a dangerous assumption to make, and I'm sure there's a cost associated to issuing that command even if it returns sooner.

TLDR: Please advise me on the optimal way to switch states as I progress through a single sorted render queue that doesn't have clear cut off points between state changes. Thanks.

Advertisement

A short update - profiled by setting once and drawing (2500 objects) vs assuming that setting the same state is fairly cheap and resetting each time. Turns out my suspicions were correct. Fps was 30 when one switch only was done, 20 when overwriting the same states, with all objects visible and everything else the same in the engine. It seems if-checks are necessary then?

I suppose there is no "best way" to solve this. If all you want to do is fire the render commands from the list of sorted renderables and you have some sort of State Cache that would prevent issuing the same state changes if previousstate == newstate that doesn't matter - you do the check inside the cache rather than within the loop, both would result in the same outcome. I wouldn't check it twice if you don't have to - so cache already handles deduplication of state chanes, no need for yet another IF one level higher.

There is a different situation where you process the list to create even more lower level commands, which is what I do. I do not issue draw calls immediately when running the loop but I have additional "batching" pass which collects similar items to compact them into as few draws as possible, heavily exploiting indirect drawing & bidnless resources. Lets say, for example, that I have such list of already SORTED renderables (#X is mesh number as in unique object having it's own model matrix not necessarily unique mesh!, Sx is shader, Bx buffers/VAO)

[#1 S1 B1 ] 
[#2 S1 B1 ] 
[#3 S2 B1 ]
[#4 S2 B1 ]
[#5 S2 B2 ]
[#6 S2 B2 ]
[#7 S2 B2 ]

Now I want to batch these for indirect drawing, which means I must split them by shader/buffer changes as these can't be done via indirect and must result in a state change and two separate calls. So I group these into 3 draw calls:

--- draw call #1 ---
[#1 S1 B1 ] 
[#2 S1 B1 ]
----draw call #2 ---
[#3 S2 B1 ]
[#4 S2 B1 ]
----draw call #3 ---
[#5 S2 B2 ]
[#6 S2 B2 ]
[#7 S2 B2 ]

This is where I use the "change in state" detection because I want to know boundary of each indirect draw call, which ends as soon as shader and/or buffers change from one renderable to another. And the way I do it, I just mask them within the sort-key and do & between previous one and current one. If you need to know which one changed (I don't) you can probably use two masks and check them separately. Important thing is I don't use any ifs here, it's just a bit calcs to find out the difference and I construct a switch/case with specific cases (resulting from mentioned bit ops and additional checks) and switch on these. I assume it's a bit better than slapping too many ifs around.

That's the final result that goes into executing actual GPU commands:

SetBuffers(B1)
SetShader(S1)
DrawIndirect(#1)
SetBuffers(B1)
SetShader(S2)
DrawIndirect(#2)
SetBuffers(B2)
SetShader(S2)
DrawIndirect(#3)

Here I don't really bother with slight duplication because StateCache won't issue same change of state anyway, and there is never many of these commands if compaction works well - I'm down to several draw calls for scenes like sponza etc. and it only varies by how many different shaders/renderstates + specific layouts of buffers I have. Everything else is bindless and does not need state changes.


Where are we and when are we and who are we?
How many people in how many places at how many times?

Something didn't save and editing doesn't work, so I will post a follow-up. While above is list of more real GPU commands, this is still not what is sent to GPU. It goes through StateCache and is further optimized, so real stream of draw and state changes on GL level would be more like:

glBindVertexArray(1)
glUseShader(1)
glDrawElementsIndirect(... #1 ...)
glUseShader(2)
glDrawElementsIndirect(... #1 ...)
glBindVertexArray(2)
glDrawElementsIndirect(... #1 ...)

Notice how some of the calls from previous stage are not visible, because they were redundant and were not issued to not cause additional strain on GPU which has no effect at all.


Where are we and when are we and who are we?
How many people in how many places at how many times?

I'll need a bit to understand all that (very tired atm) but thanks. I see the merit of the suggestion.

I see how adding the batching phase between the sort and the render could help, supposedly this is where I would take advantage of instancing as well right?
As for indirect drawing... Sadly I never heard of this until now... our graphics API instruction at uni was fairly lacking, already knew more than they taught tbh... is it a feature where you submit several DrawIndexed/DrawInstanced calls to the GPU (something which I think OpenGL supports) at once to reduce driver overhead? Not sure if DX11 has this...

As for the bit mask - that seems like great advice! I rarely used bit shifting but I know a "bit" about it... so for example - let's say I need to detect that the state changed. Below is my key for opaque meshes. It's pretty simple and probably far from good as I'm new to this but let's just assume that's it.

int64_t create64bitKey()
{
	int64_t result = 
    renderTarget << (63 - 8) 
    | shaderSetId << (63 - 24) 
    | textureId << (63 - 40) 
    | depth << (63 - 56) 
    | vertexFormat << (63 - 56);
}

Obviously if I exclude vertex and index buffers from it, as it is, I can't detect these changed. But I could detect everything else simply by bit comparison? The part of the key that mismatches can tell me which state it is based on how I packed it, is that correct? For example - 8th to 24th bit being different would mean I'm using different shaders for the next draw call.

This topic is closed to new replies.

Advertisement