When a customer reports a bug you can duplicate on your own computer, then you can solve the problem with the debugger. But what if you can't duplicate the bug? One approach is to visually inspect your code, guess what the problem might be, and send the customer a new version hoping that will solve the issue. (Repeat for as many iterations as needed—which will usually be a lot.) The faster approach is to have the program create logs that tell you what went wrong so you know where to focus your efforts. The logs might not have enough information for a full diagnosis, in which case you may need to send the customer a new version with even more logging. But the original logs should at least give you a good starting point.
In physics, there's something called the Heisenberg Uncertainty Principle, which says the mere act of observing an atomic particle's location will change it. You can estimate a particle's position accurately, or its velocity, but not both. In essence, it's physically impossible to run certain experiments without influencing the results in undesired ways. That applies to debugging, too. Debuggers often subtly alter the execution in ways that make the problem appear to go away. When that happens, look for alternate debugging techniques. If the debugger hides the problem, consider using logs instead (and vice versa).
What sort of information should your program write in its logs? The short answer is, anything you would want to know if someone told you there was a bug in your function. It may vary slightly depending on your application, but good choices usually include things like
Errors, warnings, and handled exceptions
Key computed values
Most of the time, the bug will only appear with certain input, and the customer won't be able to tell you what that input was. For example, a customer once told me the data he entered in my program was not being correctly saved to the database. One glance at the customer's log told me that he was inputting data riddled with the ‘ (single quote) character—which is a special character in SQL. In order to insert the value foo'bar’ in SQL, you need to use certain escape characters around that value, and my code neglected to do that. Once I realized what the customer was doing, the bug was easy to fix. But if I hadn't gotten the customer's exact input from the log, then the bug might have taken much longer.
User input includes things other than the user's data, too. What about the selected options and configurations? It's useful to write those to the log so you can see them when analyzing the bug. I once wrote a migration program that had a jillion options which all interacted with each other heavily. It worked fine in the default cases, but customers reported all kinds of bugs that happened only if option X was selected while option Y was blank, or when options A, B, and C were all turned on at the same time. For the first version, our tech support spent hours listening to every customer describe every option on every single bug. But on the next version, my team made sure the program listed all the user choices in the log at the beginning of each use. That saved us tons of time diagnosing future bugs.
When your program throws an unhandled exception (such as accessing the properties of a null object reference), the user will see a message that looks something like the one shown in Figure 5-1.
That message shows the stack trace of your program when it blew up, and it will be wonderfully helpful when tracking down the bug…. or at least it would be, if the user gave you that information. But if she didn't think to copy that stack trace, then you don't have any hints about where the program crashed. That is, unless you sprinkle your log with lines that indicate where the current execution is: "Now entering function ChangeDisplayType", "Now in the middle of ChangeDisplayType", "Now leaving ChangeDisplayType". Obviously, too many of these entries will make your log unmanageable, but strategically place a few on your most important functions. Then, if your program crashes in the middle of execution, the log shows you where the code was right before it crashed.
Knowing the execution flow can solve bugs other than crashes, too. Suppose you have a long function with several if statements, a few loops, perhaps a switch statement, maybe even a goto! Your users tell you the code isn't crashing, but is returning an incorrect value, and you suspect the problem is in this function. What would you do if you could look at the customer's machine in a debugger? Set a breakpoint and verify the code is going into the if statements you expect and isn't breaking out of the loops earlier than you want, right? Well, logging isn't as flexible for this purpose as the debugger, but it's better than nothing. On your important functions, add a few strategic instances of "Taking the else case of _if (foo)_" so you know what your code was doing when the problem occurred. Once again, over-doing this will make your logs too long, but done carefully, it can solve many a bug.
Sometimes, you want to see a stack trace of how your code got to a certain point. Maybe you're in the middle of a function and realize you were called with nonsensical arguments and so you want to log the stack trace of who called the function..NET makes this easy. Just instantiate a System.Diagnostics.StackTrace object, and invoke the .ToString() method on it.
Good software tries to recover from unexpected errors as gracefully as possible. Maybe your program successfully recovered from that handled exception and maybe it didn't; but either way, if the customer reports a bug, wouldn't you want to know an unexpected error had recently occurred? That may be crucial to solving the problem. In Chapter 4, we discussed how assertions are a supplement to error handling, not a replacement for it. The assertion is only there to make the programmer aware of the problem. Logging is the same way. Even if you handle the unexpected warnings and errors, note them in the log anyway to make sure you have a record of everything that ever went wrong if you need to debug.
In fact, it's a good idea to be overaggressive when logging errors and warnings. Log anything that looks even slightly unusual or suspect. Now, showing the user a message box falsely claiming there was an error would be bad because it would unnecessarily scare the user. But if you log something as an error that turns out to be OK, then so what? The user will never read the log, so no harm is done by this extra logging. And those suspicious-looking "maybe-errors" might be the keys you need to solve a bug.
One project I worked on was a management tool that allowed the user to select a mailing list in order to change its properties. Now it's theoretically possible that your mail server might not have any mailing lists because no one bothered to create any. It seldom happens in practice, but it's a legal case. However, that situation is unusual enough that it deserves mention in the program log when it occurs, because maybe it indicates a legal (but very rare) situation, or maybe it represents a true bug.
It was fortunate that I did just that, because a customer later complained about a bug with a completely unrelated part of the program. The logs didn't say anything useful about the bug he reported, but it did indicate no mailing lists were found. I asked the customer about that and he said he did indeed have mailing lists, so then I knew our code was failing to detect the objects on the mail server. Presumably, the other bug he reported was caused by a similar thing. Once I knew what to look for, solving both bugs was easy…. but I wouldn't have known what to look for if I hadn't been careful to log all the not-necessarily-but-possibly errors.
Similar to logging execution flow in lieu of using a debugger, it's also nice to log key values at carefully chosen spots. Perhaps you're concatenating several strings to create the name of a file—a very common task in programming. Suppose your code tries to delete that file and the operation fails. Which set of log messages is more useful?
1) File could not be deleted. 2) File name is "C:\Program Files\MyCompany\foo.txt". File could not be deleted.
Write that file name to the log before you use it. Think of all the possible bugs it could identify. One common bug in building file names is to forget to add a "\" character between the directory names. Viewing the file name in the log will instantly identify that problem. Or maybe you have the correct slashes, but you constructed the path wrong so that it points to the wrong directory. Again, the file name in the log will point this out. Or, maybe you built the path correctly and the problem is that somebody deleted that file. The log will give you the information you need to check that out, too. But if you hadn't written the name of the file to the log, you'd have no idea what the problem was.
Incidentally, the new System.IO.Path class has some great methods that make the task of building file names far easier than manually constructing them.
Resist the temptation to combine those two log statements into a single line: "Could not delete file such-and-such". Imagine if the operation threw an exception before it could write that log line identifying the file. Then you'd never know the file name, and you'd be left wondering if maybe the exception was caused because the file name was maybe badly formatted. So log the name of the file before you run the operation, and then log the result afterwards. Besides, logging that value as early as possible gives you one more way to track the execution flow, as described earlier.
There's a second advantage to logging key values: If you know the bug is occurring with particular data, you can search the logs for that term. Suppose we wrote some sales software and a customer told you the program failed for customer Peter Johnson. A real-world log file might be thousands of lines long. So rather than having to read all the log entries about customers that worked, we could run an automated search in our log for the name of this customer—which will take us straight to the place we care about.
It's useful to log information about the program itself. Once you have multiple versions of your product shipping, then customers will report bugs, but may not know whether they have version 6.5 or 6.0 of your program. You could ask them to look it up for you, but it's even better to just automatically write the version at the beginning of your log. If you've installed any service packs or hotfixes on the customer's machine, it's good to indicate that in the logs, too. This way, you don't waste time unsuccessfully trying to reproduce a bug until you realize, "Oh, that bug no longer occurs with Service Pack 2! No wonder I couldn't reproduce it."
You might want to start each log with the current time and date, too. This is mostly just to detect if the user accidentally sends you the wrong log. But that happens surprisingly often. Many times, a customer has sent me a log and I've asked him to run some other test and send me the log then—except the next day, the date on the log indicates it's actually the same log as before.
The.NET class library also makes it easy to get the current time and date. It takes only one line: System.DateTime.Now.ToString().
In fact, it's good practice to start every line of your log with the current time, and maybe the thread ID as well. That level of detail isn't needed for most bugs, but it does come in handy once in a while. For instance, if you see that time difference between two consecutive log lines is a full 20 seconds when the difference is usually only a second or two, that gives you a hint about where to look for a bug. Maybe it means nothing other than a slow server…. or maybe it indicates an unreliable network connection that the rest of your code isn't able to deal with. The point with logs is that as long as the information doesn't hurt, throw it in because who knows when it may help.
Just as important as what you write to the logs is how you write it. Code is much more readable when you adopt patterns like consistent tabbing or naming conventions, and logs are the same way. A good log should allow you to identify at a glance major events such as when the code entered a nested function or began an important operation. This isn't hard to do; you and your team just have to adopt a convention that works for you.
The biggest difficulty with logging is that you have to decide what to log before the program runs—but at that stage, you don't necessarily know what data will be needed to solve your bugs. If you log too much, then it will be harder to find what you want; but if you log too little, then you won't get that crucial piece of information you need. If you must err, then err by logging too much, because then debugging is at least possible (though it may take longer than it would if your logs were less wordy). But try to structure your logs in a way to make searching easy. Remember that real-world logs will be thousands or tens of thousands of lines long. Always look for ways to make finding what you need in that volume easier.
One of the easiest ways to make logs more readable is to use tabbing to show things like nested function calls. Take a look at the following log output:
In function such-and-such. Value of foo equals 6. In function thus-and-so Successfully validated data store. Beginning for loop Breaking out of for loop early Leaving thus-and-so. Leaving such-and-such.
Compare that to the exact same log tabbed differently:
In function such-and-such. Value of foo equals 6. In function thus-and-so Successfully validated data store. Beginning for loop Breaking out of for loop early Leaving thus-and-so. Leaving such-and-such.
Of course, an eight-line log will be easy to read no matter how you format it, but as your logs get larger, the ability to effectively skim without having to read every line becomes essential. Judicious tabbing can provide a huge assist there by quickly identifying where one function ends and another begins. That's important, because it essentially allows you to "step over" rather than "step into" the log lines of a function. You might think it would be difficult to ensure the proper number of tabs is inserted on each log line, but fortunately, .NET provides a very effective helper class for automatically maintaining the tab count. We'll discuss that later in this chapter.
Tabbing might be used to show relationships other than nested function calls, too. Some people find it helpful to indent the lines logged by a nested if statement or for loop. Many functions have a "main line" branch that does the bulk of the work, but there may be "side trips" in the function for which logging is important, but not as important as the main branch. If you think you may be referring to logs for a certain function often, but you suspect the logs from one part of that function will not be used nearly as often, then you might consider indenting that section.
Choose whatever convention makes sense to you; but try to be consistent about it. You want to be able to write all your logs with a structure that is immediately obvious to you.
When writing code, if you know you'll be referring to a particular section again, it helps to identify that section with a certain keyword in the comments. How many times have you seen a function begin with the comment "TODO: implement such- and-such" or "HACK: Fix this"? Writing comments like that is an excellent habit to develop. First, it shows anyone reading the code that this function has a known flaw and may need to be fixed someday. But even more important, such comments allow you to quickly search for that keyword. Before shipping, you can search for "TODO" to see if there's any thing left undone.
So common is the task of searching for "TODO" and "HACK", in fact, that Visual Studio .NET will even automatically add any comment with those words to the Task List for quick reference. That lets you see the list of all unfinished items in one place, and double-clicking any item will take you straight to the appropriate line of code. It's a very handy feature.
You might not see your TODO statements in your Task List. This is because VS .NET has a default option to filter these statements out. Right-click the Task List, select Show Tasks, then select either All or Comment. That will cause comments with TODO, HACK, or UNDONE to show up. You can also create your own custom statements under the Options tab.
But regardless of whether you search for those keywords manually or let VS .NET do it for you, the magic depends on being consistent with your keywords. If you write "TODO" in one place, "REVISIT THIS" in another, and "UNFINISHED" in a third, then you lose the ability to quickly search the entire code base for such items. In that situation, you're reduced to skimming every single line of code in the entire project.
Logging works the same way. Say I give you a 10,000-line log file and I tell you about some bug in the program. Are you going to read the entire log file looking for the point when something went wrong? Or are you going to first search for the keyword "ERROR" or "WARNING", hoping that the bug was serious enough for the program to realize something was unexpected and log a message? It may turn out you're not that lucky. Maybe all the log messages are "normal" messages that the program couldn't identify as errors. But this plan is still worth a shot.
Note that searching the log for errors only works if you consistently start each logged error message with the same keyword, for instance "ERR" or "ERROR". I've seen developers write logs with no keyword in the error messages at all—searching for the root cause of a failure in those situations was impossible, and those developers had to read the entire log to find their problem. So don't log a message like "Unexpected termination of input file C:\input.txt—file is in improper format." Instead, say "ERR: Unexpected termination of…."
Of course, you'll often have to resort to reading (or at least skimming) the entire log, anyway. Maybe there were no obvious "ERR" messages, so you don't have any way to automatically jump to the problem point. But even then, consistent use of keywords makes scanning the log easier. If you begin each major function with words like, "Entering function Foo", then you'll be able to decipher the execution flow from your log. Try to be consistent—don't sometimes say "Entering Foo" and sometimes "Now entering Foo". When logging the value of a variable, consistently say something like "VariableName = variableValue"; don't try to mix it up. Consistency equals readability.
Even though custom financial applications are supposedly the most common type of software in the world, I've never written one. I can imagine how it might look, though. Let's imagine a bank hired us to write a program that accepts some data about a customer, analyzes thousands of factors, and then returns a decision about that customer's credit worthiness. Say this program will be asked to process thousands of customers each day. A debugging log from that program might look something like this:
Begin analysis of John Doe. Checking data #1 Checking data #2 ... //Suppose there are dozens of log lines about this customer Begin analysis of Jane Wang. Checking data #1. ...//dozens of lines Begin analysis of Derek Allen. ...//dozens of lines Begin analysis of Barbara Smith.
Because we use a consistent naming convention, it's easy to search for where the next customer analysis begins. But there are so many of them that searching isn't so helpful, anyway. Even once we find a suspicious-looking line in the log, we may have to read the hundred or so lines before and after that point to understand the bug. Something that may help is if we add an eye-catcher, or obvious visual clue, about where one customer analysis ends and the next begins.
***********************************Begin analysis of Jane Wang. Checking data #1. ... //dozens of lines ***********************************Begin analysis of Derek Allen. ... //dozens of lines ***********************************Begin analysis of Barbara Smith.
The eye-catcher doesn't give us anything new to search on, but it will let us visually skim the logs and see at a glance where a new significant operation starts. Hopefully, you'll be able to solve most bugs by searching for keywords. But if you ever have to solve a bug by manually reading the log, the eye-catcher will simplify your job. With only a glance, you can tell if the log is near the beginning, middle, or end of an operation. If the code is in an operation you don't care about, just quickly hit the Page Down key to skip ahead in the log to your next operation—in many cases, you won't even need to read the logs from that other operation.
Large programs have many parts, and on many bugs, you might not care about all of those parts. Perhaps you have reason to believe that a certain bug is contained entirely in component X, and in that particular case, all the logging from component Y is just unnecessary clutter that you'd rather filter out. One solution is for each component to write to a separate file—that way if you don't care about Y's logging, then you just ignore Y's log. But that's annoying, because for some bugs in the future, you will care about both X and Y, and then it's more convenient to see both logs in the same file. Another solution is to recompile Y without the logging since you don't need it for this bug—but that's a problem, too, because tomorrow you may need to add the logging back in order to track down a different bug.
So we need a way to sometimes log Y's messages and sometimes not. That's easy—it's just a simple if statement:
if (doLoggingForY) WriteLogMessage(...);
The only problem is that we want to switch Y's logging on or off without having to change the value of doLoggingForY and recompile each time. Rather than hard-coding that value to be true or false, we need to dynamically compute it at runtime based on something like a registry setting or a configuration file. Then we can change the logging without recompiling. This is called a switch. .NET's officially recommended method for turning switches on and off is to use an XML configuration file, and we'll look at that in just a moment.
It helps to have a plan for organizing your switches. I've seen teams where each programmer has his own personal switch, so each person would see only the logging that he had written. That works well if each programmer is the sole maintainer of a relatively independent component, but it tends to break down when multiple people are co-maintainers, or when two components interact heavily. If two people both write useful logs in the same component, you'll have to turn on two switches instead of one to get all the necessary log messages. Instead, it's better to organize the switches by functional area rather than by programmer. Identify the major objects and subsystems; create a switch category for each one. It's OK to be slightly arbitrary about which category some message goes in. Some messages might not fall cleanly into any one category. That's all right—just do the best you can and don't stress too much.
Switches aren't just for filtering out the logging messages you don't care about. They're also used for turning off all logging entirely. Excessive logging can inflict a substantial performance hit on a program, so you usually disable some or all logging by default. Then you can turn on the logging if you need to diagnose a bug, but otherwise the customer won't have to worry about a performance hit.
Some developers like to use switches to create levels of logging. Level 1 logging gets only the most important information, level 2 gets everything from level 1 plus a little bit more, etc. This way, they can choose how detailed the log file will be. In fact, the .NET switches support this behavior natively, as we'll see in a moment. But personally, I've never had much luck with that approach. I often know the general area a bug is in, but I almost never know in advance what level of detail I'll need to solve the bug. So I almost always set the log level to maximum. On the other hand, s.NET's switches make this strategy so easy that you may want to experiment.
The final advice about switches is to not get carried away. Don't spend too much time trying to create the perfect system in hopes that you'll never have to enable more than one switch for any bug. Realistically, you'll usually have to enable several switches before you have enough data to accurately judge where the bug is. And remember, it's better to gather too much information than too little. So if you want to play it safe by always enabling all the switches, then there's nothing wrong with that.