Team LiB
Previous Section Next Section

1) Never Ignore a Bug That Occurs Early in a Test to Investigate a Bug That Occurs Later

Imagine you oversee an assembly line for producing widgets. Suppose the process consists of ten steps: The first step cuts the metal, the second step smoothes the edges, etc. One day you're told the widgets coming off the assembly line are defective, so you explore the system and notice that the widgets are getting smashed out of shape during step 6. You order the engineers to run a full diagnostic on step 6 to find the bug, but then you notice step 5 also has an issue. In step 5 the widgets are being dropped on the conveyor belt upside down, so step 6 isn't getting the input it expects. You think to yourself, "Well, that problem with step 5 is undoubtedly a bug we'll have to fix eventually, but right now we're already focused on the diagnosis of step 6. We'll finish fixing that bug first, and then we'll come back to the bug in step 5."

Not a very sensible way of doing things, is it? You know that the widgets are getting smashed during step 6, but that doesn't necessarily mean the root cause of the problem is in step 6. Maybe the root cause is earlier. You know the previous step has at least one bug, so it's conceivable that the step 6 bug is nothing more than a side effect of the bad output from step 5. Who knows, maybe fixing step 5 will automatically fix the problem of step 6 smashing the widgets. Maybe there really are two entirely unrelated problems, and if so you'll have to fix each of them one at a time. Yet it makes sense to take care of the earlier bug first (since that's definitely a problem) before spending time on the later bug (since that might turn out to be a mere side effect of the earlier bug).

But programmers ignore that obvious wisdom all too often. How many times have you seen someone attacking a bug, and along the way that programmer sees an assert or an unexpected result early on, yet he or she ignores it in order to focus on a different bug that happens later on? "Oh, that problem is something completely different from the bug I'm looking for," the programmer will say. "I'll deal with that later."

The reason for this behavior is perfectly understandable. We all know how important it is to stay focused on a single task when debugging—context switches cost precious time. So it makes sense to focus on one thing and not get distracted by side issues. When you're debugging some code and notice an opportunity for a performance optimization or some code refactoring, don't do it. Write a note to remind yourself to come back later, and then return to hunting down the bug. Don't get distracted.

However, if you're debugging some code and notice another (seemingly unrelated) bug that occurs before your original bug, well, that's a whole different story. In that situation, you must at least consider the possibility that this new, earlier bug is really the cause of the later bug you were originally hunting down. You must investigate the new bug, and if you determine that it truly is unrelated, then you may write yourself a reminder note about it and return to the original issue. But you may not ignore that earlier bug until you've fully convinced yourself it isn't the cause of the later bug.

It is extremely common for the later bug to be nothing more than a side effect of the first bug, and unsafe languages like C++ are particularly bad about this sort of thing. If the program writes off the ends of an array or reads uninitialized memory early in the execution, then everything from that point on is suspect. Eventually it crashes at a later point, but the real cause of the bug was the bad thing that occurred early on. Fortunately,.NET largely eliminates such memory problems, but there may still be occasional issues with legacy C++ code or the unsafe keyword of C#. Even without memory problems, it's still common to see one function return an incorrect value, which ends up causing problems later on in a completely different function.

Naturally, it doesn't always turn out this way. Sometimes the earlier and later bugs really are unrelated. What should we do then? Well, first of all, we can only decide the two bugs are unrelated after we've fully investigated them both. But say we've already done this. In that case we can use our judgment about which issue to work on first. Maybe we're on the verge of solving one bug and don't want to switch focus to the other. No problem. Maybe one bug is a mere annoyance and the other is a huge showstopper that needs to be fixed for an upcoming demo. Go for the more important bug then. Maybe we have special expertise in the area of one bug but a teammate would be better suited for the other. OK, that's fine, too.

If the bugs really are unconnected, then we can attack them in whatever order we want. But until we've ruled out even the slightest chance that they might be connected, we have to attack them in the order they appear during the run of the program.

Why You Should Always Investigate Bugs in the Order They Appear

A teammate asked me to help track down a problem involving a custom-built hashtable. Everything he put in the table somehow disappeared. So I watched as he ran the test on his computer. Soon after the values were inserted into the table, a message jumped to the screen—"ASSERTION FAILED: Current time does not match the server"—and my teammate hit the Ignore button to dismiss the assertion.

"Wait, why are we ignoring that?" I asked.

"Oh, that's something different," he replied. "Our code doesn't yet handle the case where my computer is in a different time zone than the main server. It's a bug, but we'll fix it after we figure out this other problem with the hashtable."

Sounds reasonable. The two problems look unrelated. But to be on the safe side, I asked if we could investigate that assertion anyway. Five minutes later we had the answer to both problems: Because we weren't handling the time difference between the local machine and the server, the entries in the hash-table were dated 3 hours away from the "current" time of the server, so a validation routine incorrectly assumed the data was old and purged it from the table.

Boom! Two bugs killed with one stone. All we had to do was make a one-line change to fix the time zone bug, and then the hashtable bug disappeared. My teammate had spent hours tracking down that hashtable problem, and on every single run of his test, a big flashing assert jumped up screaming, "Look here! Look here!" and yet my friend continued to ignore that warning over and over again because he assumed that bug was unrelated to the bug he wanted.

Once you see any bug in the run of a program corrupt your data, then all bets are off from that point on. Any later bug could potentially be nothing more than a side effect of the first bug.


Team LiB
Previous Section Next Section