Principles of Debugging Effectively

posted in slayemin's Journal

Published April 12, 2017

I just spent three days debugging a problem within my VR game. I don't quite understand the root cause of the problem, but I do know enough about the nature of the problem to create an effective work around.

Over the course of my career as a programmer and developer, I've gradually gotten better and better at debugging and troubleshooting. I think this is a really hard skill to get down and get good at because its more of a methodology and way of critical thinking to master than a particular debugger in an IDE. Without wasting a bunch of time, I'll get right to it.

The first step is to identify a problem. It's a problem if the behavior of the software application does not behave as it was designed or intended to behave.

The second step is to be able to reliably reproduce the problem. This isn't always easy. In theory, someone could create a random number generator which creates undesirable behavior in a 1 in 100 chance, and that would make the behavior very difficult to reproduce. If you can, it would be super helpful for you to be able to create a game "replay" file, and then save and replay the game and observe the behavior.

Let's say you get this far: You've identified a problem and you can reproduce it reliably. Now what?

This is when we start to practice the "scientific theory". This is where I've seen many, many people screw up and waste time chasing ghosts and trying to fix problems they don't have.

The first step in this process, is to write down your hypothesis. Seriously. Write it down. I'm not kidding. Open up notepad and write down your very best guess at what you think is causing the problem. Remember, this is a guess. It is NOT a diagnosis!!!

In science, when you have a hypothesis on how something works or why something happens, a scientist doesn't go around trying to prove that the hypothesis is correct -- instead, they try to falsify the hypothesis by finding contradictory evidence. The distinction here is ultra important! All you need to do is find one contradictory counter-example to invalidate the hypothesis. If a hypothesis is invalidated, it means that the hypothesis was wrong and we can now move on to formulating a new one. Having a hypothesis invalidated is the opposite of bad, it's very good because now you're one step closer to truth.

I take the same approach to debugging. I write down my hypothesis. Then I try to find ways to prove it is wrong. It is more often the case that I have about 10+ different hypotheses which could explain the cause of the problem. Either zero or one of them is correct. To find the correct hypothesis, I begin running tests and collecting data to invalidate each possible hypothesis by counter-example and contradiction. I want to prove via negation that a hypothesis is impossible. Usually, this involves a lot of isolation of variables and potential causes. During every test, I also write down what I tried and what the results were. This helps me be more rigorous and formalized with my methodology. I also forget things, especially after hours and days of testing, and writing things down helps me not waste precious time repeating experiments.

The process of narrowing down multiple hypotheses to a single hypothesis, is itself a skill which gets developed over time. Valuable knowledge isn't just about getting answers, it is also about asking the right questions. Let me illustrate with a scenario:

Imagine that one day, you meet an angel. It is a divine creature and meeting it is extremely rare. The angel tells you, "I will grant you the true answer to any question you want. What question would you like answered?". Some people might be tempted to ask about the meaning of life, or when they'll die, or next weeks winning lottery numbers, but the very best question to ask is, "What is the best question to ask you?"
For, if you know the right question to ask, finding the answer is relatively easy.

So, when you're trying to negate/break your hypothesis, focus carefully on asking the right questions. Don't be afraid to take five to ten minutes to think about it. It is better to move slowly in the correct direction than to rush quickly in the wrong direction.

Finally, when you have a hypothesis which has withstood the onslaught of testing, you have something which is demonstrably strong. The hypothesis may still be wrong and you just don't know it yet, but it's an operating hypothesis now -- you can reasonably assume it is correct until proven otherwise. I have had operating hypotheses proven wrong on multiple occasions, and that is always a humbling moment for pause and reflection. Once you have an operating hypothesis, you can begin diagnosing and fixing the problem.

I won't get into how to fix problems. That's beyond the scope of this and kind of irrelevant because its so subjective.

Once you believe you have fixed your problem, it is now time to TEST. Did the fix actually work? Is the problem behavior still persisting? If the problem is still there, then either your operating hypothesis is flawed or the fix is flawed.

If you think you fixed the problem, great! You have a new hypothesis to disprove! On multiple occasions, I have mistakenly believed that I have fixed a problem when in truth, my insufficient testing lead me to believe I had fixed it when I really had not fixed it.

Why is this rigorous scientific methodology so important to follow? Because it works, it saves time, and it finds truth.

I once worked with a few novice programmers and sys admin contractors in the US military. They were an embarrassment to the professions. Invariably, as with all software and IT systems, stuff would break or stop working. What did they do? They immediately rushed to the first hypothesis which came to mind, assumed it was true, made it their diagnosis, and began fixing the diagnosed problem. Their fixes would take days of effort and lots of coordination between groups in various parts of the organization. The problem is, they were often wrong. Very wrong, They'd apply a fix and it wouldn't fix anything and the problem would persist. Then they'd invent a new cause, rinse, lather, repeat, until eventually, they got lucky and stumbled into the right answer or everyone gave up and scraped the project. Lots of fingers would get pointed, lots of baseless conspiracy theories are invented, etc. You can imagine the nonsense they put everyone through.

So, back to my debugging experience today:

I used the process I outlined above. I wrote down my initial hypothesis and devised a test to prove it wrong. I successfully proved it wrong. I invented a new hypothesis. Proved that one wrong as well. I wrote down about 18 different hypotheses, before I eventually narrowed down the specifics of my problem.

The problem is that when I enter my game level with an oculus rift VR headset, the screen is entirely black. This only happens on "packaged builds" of my game, which means it is ready to ship to customers. This game breaking bug slipped through my informal QA because I had made some dangerous assumptions. Oddly, this bug is particular to just the Oculus Rift. If I repeated my reproduction steps with an HTC Vive, I would have no issues. Was it an engine bug or a project bug? I recreated a similar scenario with a new project using all of the same settings, and failed to reproduce the error. So, its a project specific error. Is my level corrupted somehow? I made a duplicate of the level and started deleting half of my assets. If it magically worked, then I knew that one of the assets I deleted was the culprit. Sort of. It turns out that if my level has sub levels and those sub levels are set to automatically load as the game loads, oculus can't handle it in the latest version of the engine, only in my project. I have no idea why, but putting a 2 second delay in the game followed by a manual loading fixed the problem.

So, to recap:
1. Formulate a hypothesis.
2. Try to invalidate / negate your hypothesis through data collection. Goto step 1 until step 2 fails.
3. You have an operating hypothesis, now apply a fix.
4. Test your fix. "My fix worked" is your new hypothesis, return to step 1.
5. Go slow and be right.

Okay, if you've read this far, here's the kicker: This process and methodology doesn't just apply to debugging software and IT systems. It applies to everything in life.

Problem: "People don't know my game exists."
Hypothesis: "If I create a bunch of facebook posts / ads, people will know about my game."
Test: Create posts
Data Collection: Look at your web traffic and analytics. Did you see a change? Is your hypothesis invalidated?

When you are armed with this process and methodology, you can find truth. When your actions are based upon truth, you will enjoy successful results based upon those actions grounded in reality.

Previous Entry Spellbound: February Update

Next Entry Spellbound: April Update

5 likes 6 comments

Comments

jbadams

Indeed, we practice computer science, not computer voodoo!

April 12, 2017 02:01 PM

MagForceSeven

I agree with a lot of that in-theory. In practice I've found it less useful but I think that's because of working in a different environment. Which is to say a collaborative studio environment as opposed to a one-man or small team. The main issue is in order to come up with a hypothesis you have to have some level of understanding of how things are supposed to work so that hypothesis about potential failure points. I've routinely gotten bugs from QA or other devs that include hypotheses of what is wrong that make no sense when you know how the system works.

And I'm asked on way too many occasions to find & fix bugs in code I have only a passing familiarity with. Definitely not enough to form a hypothesis about what is broken.

For those types of bugs I've come up with the "Mindy Process", named after Mindy from the "Buttons and Mindy" segments of Animaniacs. If you're not familiar with that character, part of her shtick was asking someone a question and then responding to every answer with "Why?" (I swear I tried to find a video to explain it better with little luck).

1) Reproduce

2) Why are we doing that?

3) Oh, that's why. Is that supposed to happen?

4a) No? Make it not do that

4b) Yes? Something must be wrong earlier. Goto 2.

5) Reproduce

That process looks kind of like the loop of "Why? Because... Why? Because..." that Mindy gets into.

Now this mainly useful when you're able to work with a debugger and can break, step, inspect everything and watch what the program does live. It also tends to much less useful the lower the reproduction rate is since you have to be able to stop the program when the case you care about is going wrong.

April 12, 2017 05:28 PM

slayemin

I agree with a lot of that in-theory. In practice I've found it less useful but I think that's because of working in a different environment. Which is to say a collaborative studio environment as opposed to a one-man or small team. The main issue is in order to come up with a hypothesis you have to have some level of understanding of how things are supposed to work so that hypothesis about potential failure points. I've routinely gotten bugs from QA or other devs that include hypotheses of what is wrong that make no sense when you know how the system works.

And I'm asked on way too many occasions to find & fix bugs in code I have only a passing familiarity with. Definitely not enough to form a hypothesis about what is broken.

For those types of bugs I've come up with the "Mindy Process", named after Mindy from the "Buttons and Mindy" segments of Animaniacs. If you're not familiar with that character, part of her shtick was asking someone a question and then responding to every answer with "Why?" (I swear I tried to find a video to explain it better with little luck).

1) Reproduce

2) Why are we doing that?

3) Oh, that's why. Is that supposed to happen?

4a) No? Make it not do that

4b) Yes? Something must be wrong earlier. Goto 2.

5) Reproduce

That process looks kind of like the loop of "Why? Because... Why? Because..." that Mindy gets into.

Now this mainly useful when you're able to work with a debugger and can break, step, inspect everything and watch what the program does live. It also tends to much less useful the lower the reproduction rate is since you have to be able to stop the program when the case you care about is going wrong.

The beautiful thing about treating everything as a hypothesis is that you allow and expect it to be wrong, and it's only less wrong if you can't disprove it. If your QA team or other devs submit a bug with their hypothesis on what's wrong, they will most likely be wrong, but that's okay and totally allowed. The key is to treat it as "most likely wrong until proven otherwise".

The "Mindy Process" you use sounds like it is somewhat of a derivative of my larger process. When you ask "Why are we doing that?", you're asking a direction finding question and looking for an answer, and the answer itself is the hypothesis. The answer could be wrong, which you allow for in #3, "Is that supposed to happen?" which is a form of testing the hypothesis for validity. The "Oh, that's why" part is the part which is hardest to answer because there could be 10+ reasons why something is happening, so you have to have a narrowing down of possibilities/causes phase. Sometimes the answer is really simple and straight forward (such as a divide by zero error), but other times, the answer is going to be really complex and require a perfect storm of conditions to manifest undesired behavior.

My process works especially well when there is a lot of ambiguity surrounding a problem. If the problem is relatively simple and straight forward, the process still works, it's just proportionately a lot faster.

April 13, 2017 01:13 AM

laiyierjiangsu

I have learned something from it ,thanks!

April 13, 2017 11:55 AM

MagForceSeven

The beautiful thing about treating everything as a hypothesis is that you allow and expect it to be wrong, and it's only less wrong if you can't disprove it. If your QA team or other devs submit a bug with their hypothesis on what's wrong, they will most likely be wrong, but that's okay and totally allowed. The key is to treat it as "most likely wrong until proven otherwise".

I guess my point is that when using the debugger I'm not trying to make hypotheses and test. I'm trying to work from empirical events. It can't be wrong because I just watched it do the thing that causes the problem (or causes the sequence of events to occur that cause the problem). If a variable is wrong, don't hypothesize about why it's wrong, set breakpoints or other catches so that the moment that it becomes wrong you drop into the debugger to see empirical cause. This makes the answer to the question "Why?" not a hypothesis but fact.

I agree that they are related processes, but for some bugs it's faster (sometimes lots if depending on the repro steps). For example, say that a variable is the wrong value and it's set from places A, B & C. You could hypothesis that it's B, change B, reproduce to verify your hypothesis which may or may not be right. Or you could trap A, B & C. See that C, for a fact, is setting it incorrectly and fix C. Now sometimes proper abstractions help with this. When a is private with a Setter, you implicitly trap all change cases. But not all code is that convenient. Even if there are 10+ reasons, computers are great a complicated checks and when you have the source you can make all sorts of temporary local changes to catch problems that you never have to actually check in. Trap them all!

Fundamentally this is what assertions are for, proactively identifying conditions that will cause problems further along the line if allowed to persist and stopping execution or logging it so that they can be identified and fixed.

In addition, hypotheses have no place in a bug report. A bug report is an empirical report of something not functioning in the expected manner. They may have needed to build a hypothesis to build steps they could reasonably reproduce the effect, but it's the repro steps that belong in the bug and not their working hypothesis. The distinction can be mostly easily seen retroactively when the problem has been found and the listed repro steps include things that have no actual impact reproduction. I've sent many bugs back to test (or onto other programmers) with modified repro steps saying that one or more of the steps are unneeded to actually reproduce the problem.

I think there is a place and time for both methods of testing (and as it happens we're primarily in bug-fix mode at the office right now so this is right in the forefront of my mind). The Scientific Method is great/useful for black-box testing (QA) and for low reproduction rate problems. This "Mindy Method" is primarily for white-box testing (ie. when you have the source) and high repro rate issues. It really only helps if you can cause the problem to happen repeatedly to power the "Why? Because..." loop.

I'm not saying my screwdriver is better than your hammer. I'm just saying that I've had a lot more success handling screws with a screwdriver than with a hammer.

April 13, 2017 01:21 PM

MagForceSeven

One more thing. Like I mentioned in my original post, I tend to have to fix bugs in code that I don't know that well which usually leaves me with insufficient knowledge to form any sort of hypothesis in the first place so I have to fall back on the facts of empirical results. So if there's a 12 failure locations I have to instrument them all because I'll have little to no systemic knowledge that allows for any sort of culling.

April 13, 2017 02:51 PM

You must log in to join the conversation.

Don't have a GameDev.net account? Sign up!

slayemin

Author

🎉 Celebrating 25 Years of GameDev.net! 🎉

Principles of Debugging Effectively

Comments

slayemin

Latest Entries

Contract Complete at Electronic Arts

Three months in at Electronic Arts

Programmer Art & Digital Humans

Moved my office today

I'm quite happy

The only constant in life is change.

Ace Pilot Game

A wild thought appeared

Shipped a pilot

Kissing rock bottom

🎉 Celebrating 25 Years of GameDev.net! 🎉

Principles of Debugging Effectively

Comments

slayemin

Latest Entries

Contract Complete at Electronic Arts

Three months in at Electronic Arts

Programmer Art & Digital Humans

Moved my office today

I'm quite happy

The only constant in life is change.

Ace Pilot Game

A wild thought appeared

Shipped a pilot

Kissing rock bottom

Reticulating splines