Three Guesses

Have you ever noticed that even when doing brute-force debugging, you still have an innate sense for which areas of the code are more important than others? Some sections of code have always worked well in the past or are merely "overhead" code that is irrelevant to the task at hand, so you skim over them. Other sections are so straightforward that you blow right by them, too. But some sections trigger a mental "hmm" reaction, so you study those areas more carefully. Seven times out of ten, you're correct. Sure, you occasionally have to backtrack to a function you originally stepped over, but most of the time, the bug really does exist in one of the areas you expected.

Think about what that means. By definition, the source of the bug is unknown. It could be anywhere. But you somehow used intuition to make a remarkably accurate judgment about which sections of the code to ignore and which sections were likely causing the bug. How did you know? Are you psychic?

Probably not. Probably your intuition came from having spent 8 hours a day for the past 6 months staring at your project's code. You know which sections are tricky and which areas are straightforward. You're familiar enough with the project to know which code has been recently modified and which code hasn't been touched in weeks. You understand what each function does, so you can mentally cross-reference that with the bug and know which functions can be dismissed as unlikely to cause the buggy behavior. This doesn't work if you've only recently inherited an unfamiliar code base from someone else, but if you've been working with this project for any length of time, you probably know the code better than you might realize.

You're still not convinced? Let's ask the question a different way. Have you ever spent hours unsuccessfully trying to track down a bug, only to have the answer come to you later when you weren't even looking at the computer at all? Most developers have. Maybe you were driving home, letting your mind wander when inspiration struck. Maybe you were dripping wet in the shower when you figured out the bug. Maybe you were explaining the problem to a coworker and said, "But the bug can't be X because of thus-and-so… unless… hmmm… wait, that's it!" But almost everyone has had these moments where they surprised themselves by knowing more about the code than they ever would have guessed. You have the power—believe it.

If you already had an estimation of where the bug was and knew to skim over the other code in order to focus on the important areas, then why did you bother to look at the other code at all? Rather than stepping across the entire code from start to finish, why didn't you just set a breakpoint on those two or three important areas and look only at them? Better yet, forget stepping in the debugger and create some unit tests to exercise only the important areas so you can narrow down the exact nature of the bug.

There are a few possible answers to these questions. You probably "know" the 50 states of America in the sense that you recognize them when you hear their names, but you probably couldn't recite all 50 from memory. Maybe the important sections of your code are the same way—you recognize them when you see them in the debugger, but it would have been hard to enumerate them otherwise. That's a reasonable theory, but it doesn't explain the whole story. Despite the "50-states-of-America" effect, if you've been working with a code base for a long while, you'll find you often (definitely not always, but still very often) can predict the approximate location of a bug without even looking at the code.

Still Don't Believe Me? I'll Prove It to You

Try this test. Next time you tackle a nontrivial bug that "could be anything," then before getting out the debugger, spend a few minutes thinking about the likely source of the bug. Think about whether you have seen anything like this before and what the cause was last time. Think about whether the bug sounds like a special case you forgot to handle when writing the code. Next, think about "the big picture" of your program and write down the three functions you guess are most likely to cause the problem. Just for fun, go even further and write down a guess about the specific problem that function is encountering.

Telling a Story

The steps for reproducing some bugs are so complex that they seem a mere coincidence. The program works fine if the screen background is a picture of a mountain, but crashes if the background picture is a river. The bug happens only when the user stands on one foot, or on Thursday when it's raining. It's often easy to imagine patterns where there aren't any. Be skeptical when you see strange patterns, and try to find a counterexample.

But if further testing confirms the wacky pattern exists, try to tell a story about the bug that accounts for the weirdness. Think in terms of alibis and opportunities: "I think the installation code did X, which caused Y, which enabled Z, and that's how the bug happened. But that fails to explain this other weird thing…." If any detail doesn't fit cleanly into your story, you're probably on the wrong track. When you find the correct answer, you'll usually know it because the correct story will cleanly tie up all the weird loose ends.

Not long ago, my team reported a bug that would sometimes crash our program, but only when 1) the computer was a slower model, 2) the computer had Microsoft Office installed, and 3) the program was run immediately after the installation finished without waiting more than a few seconds. We considered several theories—but none of them made a good story. We told ourselves, "This theory doesn't explain why you have to run it in the first few seconds, but that's probably just a false pattern, anyway." Of course, all our theories were wrong.

When we hit on the answer, we immediately recognized its truth, because it explained everything. As soon as our installer finished installing our product, it displayed a message announcing this; but even though the installer looked like it had terminated, it was actually still running in the background, installing a small demo for another of our products. The crash was caused when the user ran our product while the installer was still updating files. This explained weird condition 1 (on fast machines the install was finished before the user had time to start the product), and condition 2 (the demo depended on Microsoft Office and we didn't install it if Office wasn't present), and condition 3 (waiting a few seconds before running let the install finish).

For each theory about a strange bug, tell a story reconstructing how that theory could cause the bug. If there's any plot hole or unanswered questions, then put that theory aside and come up with a new one.

Consider the following bug:

BUG

The program sometimes (but not always) saves the user's documents to the root directory when it should instead save them to the user's home directory.

Now, here are three guesses about what might be causing that bug:

Maybe the GetNameOfCurrentUser function fails to identify the user for some reason, so we don't know which home directory to use, and therefore fall back to the root directory. Possibly that funky code that gets the user's name from his or her web browser connection is the culprit?
Maybe the WriteDocumentToDisk function tries to write to the correct home directory, but fails for some reason, so we fall back to the root directory. Possibly we're not checking whether we have write permissions over that home directory?
Maybe the SaveLegacyDocumentFormat function is screwing up somehow. That function is very integral to this feature, and a coworker recently added a lot of tricky new code, so possibly the bug is somewhere in there?

Don't randomly guess. Don't make wild leaps of faith without evidence— "Well, maybe the user manually removed the disk from the disk drive in the middle of a file save and just forgot to mention that in the repro steps." If a theory sounds like a stretch, it probably isn't the case. And in many cases, you won't be able to think of three reasonable guesses. That's OK. This technique doesn't work 100 percent of the time. If you only get one good guess, then so be it.

Note

The bug described in the preceding text is one I actually encountered when writing an online publishing application. Based on your own experiences, can you guess what the cause of that bug was? It was the second guess: a permissions problem. That's a very common cause of trouble, and I'm sure you've seen something similar. If you haven't seen one yet, you will once you get started with.NET. As we'll see in Chapter 7, security is handled very differently in.NET than it was with traditional Windows applications, and developers need to carefully plan for security permission bugs.

Set Up a Test Case

After you've made your guesses, set up some test cases and step over just those sections of the code with the debugger. Was one of your guesses right? Assuming you're dealing with code that you know reasonably well, then you might surprise yourself with how often your initial guesses turn out to be correct. Even if your guess isn't the cause of the bug you're hunting, it might actually point you at a different bug that's just waiting to happen. I once theorized a bug was caused by a failure to check whether the program had write permissions over a particular directory, but it turned out the bug was actually something else. Even so, I really had forgotten to check for write permissions, too. I wrote myself a note to fix that later, and then went back to hunting down the original problem.

Guessing Wisely

I don't recommend you play multiple rounds of this guessing game. Make your three guesses and check them out. If you find the problem, great; if not, then stop guessing and move on to a different approach. A second set of guesses won't be nearly as good as the first. The point is to use your intuition as a tool—but don't spend too much time doing so because intuition is powerful but not very reliable. Basically this technique is to spend 5 minutes for a 20 percent chance of finding the bug with near-zero debugging effort. Given the low cost and the potential returns, you can afford to play this guessing game a few times; but given the odds, if it doesn't pay off after the first three tries, then cut your losses and fall back to the other debugging approaches.

And what if after several minutes you can't think of any good guesses about the cause of the bug? Then don't use this trick for that particular bug. This isn't a technique for investigating long shots. If you can only think of one likely theory and two that seem unlikely, then you should skip the second two and spend time only on that first one. Those other two aren't worth the effort. Accept that guessing won't solve this particular bug and move on to the next technique.

Ever Seen Something Like This Bug Before?

Sometimes the symptoms of the bug sound so familiar that you know exactly what the problem is without even reading the code at all. Development managers love these bugs because it lets them get to sound smart and suggest a fix for the problem without even looking at the code. But bugs like that are rare. In general, you have to already know the code to make good guesses about its bugs. Otherwise, don't even bother playing.

I once worked at Microsoft helping to write Microsoft Outlook 2000, and even though I had nothing to do with the particular bug described next, I wonder if we could guess the likely cause based on the following description.

BUG

Microsoft Outlook correctly allows me to paste a link to a document from Lotus Notes into an e-mail I'm writing. Next I type some text (doesn't matter what), then I can drag-and-drop that link to a different point in the e-mail. But if I drag-and-drop that link again, Outlook crashes 100 percent of the time. This bug does not occur with any other type of document link—only Lotus Notes links.

Based on your own experience, do you know what might be causing this bug? Neither do I. It doesn't sound like anything obvious. Maybe the person who wrote the drag-and-drop code would recognize something from the description—after all, he's lived and breathed this code for the past 6 months. But maybe not. This bug sounds complicated, and he'll probably have to look at the core dump file or the logs. In fact, the description makes it sound like several components may be involved and it may well be that no one person knows all of the code well enough to make a reasonable guess about the cause of this bug. Don't even waste your time trying to make guesses here.

On the other hand, what about this one?

BUG

My legacy C++ code prints out values that appear corrupted when printed on screen. It should be printing floating-point numbers, but they all appear as zero. However, even though the numbers are printing as zero, other evidence indicates that the numbers aren't really zero because when the program adds them, I get valid results. Finally, the program eventually crashes when it tries to print out the numbers a second time. The logs of the program indicate the crash occurs right as I'm printing out the values.

Every C/C++ programmer in the world knows the cause of this one! Look at the manual for printf, figure out which of the funky format strings (%d,%f,%ls, etc.) you're incorrectly using, and then be grateful that the modern languages of .NET offer type-safe ways of printing output so that we no longer have to deal with bugs like this anymore.