🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Is branching logic in shaders really still a problem?

Started by
10 comments, last by MJP 1 year, 12 months ago

On modern PC hardware I have found branching logic to have no downsides. I do stuff like this all the time:

if (textureID != -1)
{
	color *= texture(sampler[textureID], texcoords);
}

I don't believe the whole “GPU executes both branches and takes the result of one” thing is an issue anymore.

What about on the latest mobile graphics chips? I may want to support Oculus Quest in the future, so I am primarily concerned with Adreno 530, 540, 650 chipsets. What's the current situation there?

10x Faster Performance for VR: www.ultraengine.com

Advertisement

Josh Klint said:
if (textureID != -1) { color *= texture(sampler[textureID], texcoords); }

I don't believe the whole “GPU executes both branches and takes the result of one” thing is an issue anymore.

I never really believed this either. Starting with Kepler / GCN, i never felt any problem from using branches.
Your example is even a win. By using the branch the bandwidth of a otherwise useless fetch is avoided. In this context i doubt ‘both sides are executed’ was ever true. I prefer to think about it as ‘some threads do nothing’. They may execute instructions, but memory operations are not executed. (As far as i can tell from profiling the difference.)

But i lack serious experience with mobile GPUs. They do not even publish teraflop numbers, so their expected performance is mostly a mystery to me. I would be interested mainly about compute performance. Adding this to the question, if somebody can tell…

I would be wary of simplifying things here too much, there's a lot of nuance to branches. Really you need to break things down into 3 kinds of branches to understand things better:

  1. Branch on a compile-time constant - this branch really doesn't exist in the end. Once the compiler's optimizer runs it will remove that branch entirely and dead-strip away anything that won't ever execute. People tend to rely on this a lot for ubershader setups that generate many shader permutations.
  2. Branch on a uniform value (from a constant buffer/uniform buffer typically) - These turn into actual branches executed by the GPU, but they're “best” kind of branches: all threads in a wave/warp will take the same path because it's a uniform, giving you the “skip the instructions in the branch not taken” behavior that you're used to on GPUs. At a lower-level you don't need the branch condition to be uniform across the entire draw/dispatch, it really needs to be uniform for the entire warp/wave. But this is generally tricky to achieve consistently outside of special cases and/or careful use of wave intrinsics (AKA subgroup operations). One thing to watch out with this is that these branches can still end up making your shader program larger since the instructions for both sides need to be included, and the compiler will still need to allocate enough registers to handle both sides of the branch. Both of these can potentially cause your performance to suffer even for a branch never taken, through I$ misses and/or reduced occupancy due to register pressure.
  3. Branch on some arbitrary runtime-computed value - this is where things can get nasty thanks to the SIMD execution model. If that value is not the same within a warp/wave the hardware will execute both sides of the branch. You will get all of the issues with #2, along with potentially poor performance if your branch condition is not coherent within your warps/waves. You also get the added headache that you can use explicit or implicit derivative calculations (such as the “auto-select-the-mip-level” behavior of Texture2D.Sample from HLSL) inside of these branches, since the threads with a pixel shader quad may have taken different paths in the branch.

So in summary, yes you're correct that you shouldn't just assume that branches are always bad and they can absolutely be a real win if deployed correctly. But you still need to be careful about what you're branching on, or you can get into trouble.

@mjp Everything you listed in case 3 also happens to be things that could not be decided with an #ifdef. So why is this a huge problem for everyone in the industry except @joej and I?

10x Faster Performance for VR: www.ultraengine.com

I think you can bring any piece of hardware down in performance if you use it wrong. This is of course not a binary thing, it's a scale. You gradually go down if you do more “bad” things with it.

The question is perhaps more about how much do you care about using the hardware to its fullest potential. More extremely said, how obsessed are you with speed? How easy are you about accepting that you may be causing non-optimal hardware use due to your non-optimal programming?

I would say, given the amount of discussion about performance here, many are extremely scared about it, sometimes even before getting anything running.

@alberth Performance is an extremely high priority for me, the whole value proposition of my software is based on it. Yet I seem to have absolutely no problems, even with performance intensive stress tests.

10x Faster Performance for VR: www.ultraengine.com

Basically what MJP said. Dynamic per-pixel branching is the issue. Every card operates on 2x2 pixel blocks (unless this has changed recently). Same concept as wavefronts. If all 4 pixels don't execute the same code path there are some issues that arrive.

NBA2K, Madden, Maneater, Killing Floor, Sims http://www.pawlowskipinball.com/pinballeternal

dpadam450 said:
Same concept as wavefronts. If all 4 pixels don't execute the same code path there are some issues that arrive.

Performance wise, the behavior is the same no matter if pixel or compute shaders, because pixel shaders also run in wavefronts / warps. If only one pixel out of 32/64 diverges, the other quads in the wave slow down too.

However, programmers need branches, otherwise there's not much you can do at all. Which is why i never understood the advice of ‘avoid branches’ even on very old GPUs.

Personally, i've had one problem which helped me to get rid of the compulsive desire to keep all your threads 100% busy all the time: Simple N-body problem.
Imagine we have workloads of many N-body problems, each having a variable size between 30 and 200 bodies.
The fastest way to process them is to have dispatches of varying workgroup sizes 32, 64, 128 and 256. Each workgroup processing problems ≤ their size. And > their half size, because then the next, smaller workgroup wastes less threads.
Rephrasing this, we can say: ‘To solve the problem most efficiently, you'll have 75% of your threads busy.' The rest is idle, but this does not change the fact of ideal efficiency.

The same is true for branches in general, if you look at it from this angle.

@undefined I'm not sure about elsewhere, but nobody here said that “branching is always slow” ?. I said something very different: that there's a ton of nuance here and everything needs to be carefully qualified. If you're seeing advice somewhere that “branching is always slow” it's either wrong or outdated. In reality trying to assign some “cost” to your if statement is complicated and depends on a whole host of factors, which is why I went out of my way to immediately break things out into 3 broad categories to illustrate how wildly different things can happen on the hardware level depending on what you're doing (and I didn't even cover the case of drivers messing with your shaders behind the scenes, which can happen).

People use branches all of the time in shaders these days. In general I would not say you should automatically avoid them or anything like that. If they're working well for your case that's great, just be careful not to extrapolate too much from one test case (for example: it's not at all clear from your code snippet whether textureID is uniform or divergent). Instead I would recommend taking a principled approach towards performance where you gather as much information as you can, make reasonable estimates based on your knowledge and mental models, and verify your assumptions wherever possible. If you go too far in the other direction and simply assume that branches are always going to work out for you, you will inevitably end up hitting cases where your assumptions are incorrect. I would also be wary of absolute statements about performance, they almost always need to be qualified in some way.

MJP said:

I would be wary of simplifying things here too much, there's a lot of nuance to branches. Really you need to break things down into 3 kinds of branches to understand things better: […]

@MJP : Very nice answer! I do have one question though (I hope you see this) :

I have often been wondering if there is any difference between if-branches and switch-branches. From CS courses we learned that the second option may allow the compiler to create a jump-table, which eases the logic needed for branch-prediction and -resolution. Following your intuition in bullet #2, the logical deduction would be that whichever value we decide to switch on has to be a compile-time constant for maximum efficiency, or a uniform-value for semi-optimal runtime behavior. And finally - would the same penalty with register allocation overhead occur with switch as it would with if ?

I wrote some code for a tutorial on bloom, where this setup handles downsampling of an image:

switch (mipLevel) // `mipLevel` is a uniform int
{
case 0:
    [...]
    break;
default:
    [...]
    break;
}

Granted, it could be replaced with an if-else branch. But I have used this pattern before with more clauses, and am curious about best-practices. Thanks in advance!

This topic is closed to new replies.

Advertisement