First Normal Form

A relation is in first normal form if the domains on which its attributes are defined are scalar. This is at once both the simplest and most difficult concept in data modeling. The principle is straightforward: Each attribute of a tuple must contain a single value. But what constitutes a single value? In the relation shown in Figure 2-11, the Items attribute obviously contains multiple values and is therefore not in first normal form. But the issue is not always so clear cut.

Figure 2-11. The Items Attribute in This Relation Is Not Scalar

We saw some of the problems involved in determining whether an attribute is scalar when we looked at the modeling of names and addresses in Chapter 1. Dates are another tricky domain. They consist of three distinct components: the day, the month, and the year. Ought they be stored as three attributes or as a composite? As always, the answer can be determined only by looking to the semantics of the problem space you're modeling.

If your system most often uses all three components of a date together, it is scalar. But if your system must frequently manipulate the individual components of the date, you might be better off storing them as separate attributes. You might not care about the day, for example, but only the month and year. Or you might care only about the month and day, but not the year. This isn't often the case, but it happens.

In the specific instance of dates, because date arithmetic is tedious to perform, it will often make your life easier if you use an attribute defined on the DateTime data type, which combines all three components of the date and the time. The DateTime data types allow you to offload the majority of the work involved in, for example, determining the date 37 days from today, to the development environment.

However, DateTime attributes can get you into trouble if you're doing comparisons on a single component. This is particularly true if you're ignoring the time component of a field. For example, if your application sets the value of a DateCreated field to the result of the VBA function Now, which returns both date and time, and then later attempts to compare it to the value returned by Date(), which returns the date only, you might get unexpected results. Even if you never display the time to users, it is being stored, and obviously "1/1/1999 12:30:19 AM" isn't the same as "1/1/1999".

Another place people frequently have problems with non-scalar values is with codes and flags. Many companies assign case numbers or reference numbers that are calculated values, usually something along the lines of REF0010398, which might indicate that this is the first case opened in March 1998. While it's unlikely that you'll be able to alter company policy, it's not a good idea to attempt to manipulate the individual components of the reference number in your data model.

It's far easier in the long run to store the values separately: {Reference#, Case#, Month, Year}. This way, determining the next case number or the number of cases opened in a given year becomes a simple query against an attribute and doesn't require additional manipulation. This has important performance implications, particularly in client/server environments, where extracting a value from the middle of an attribute might require that each individual record be examined locally (and transferred across the network) rather than by the database server.

Another type of non-scalar attribute that causes problems for people is the bit flag. In conventional programming environments, it's common practice to store sets of Boolean values as individual bits in a word, and then to use bitwise operations to check and test them. Windows API programming relies heavily on this technique, for example. In conventional programming environments, this is a perfectly sensible thing to do. In relational data models, it is not. Not only does the practice violate first normal form, but it's extraordinarily tedious and, as a general rule, inefficient.

Unfortunately, this is the kind of constraint that frequently gets imposed on you for historical reasons, but if you've any choice in the matter, don't encode more than one piece of information in a single attribute. If you're using legacy information, you can always unpack the data and store both versions in the recordset.

There's another kind of non-scalar value to be wary of when checking a relation for first normal form: the repeating group. Figure 2-12 shows an Invoice relation. Someone, at some point, decided that customers are not allowed to buy more than three items. I wonder if they checked that decision with the sales manager first? Seriously, this is almost certainly an artificial constraint imposed by the system, not the business. Artificial system constraints are evil, and in this case, just plain wrong as well.

Figure 2-12. This Data Model Restricts the Number of Items a Customer Can Purchase

Another example of a repeating group is shown in Figure 2-13. This isn't as obvious an error, and many successful systems have been implemented using a model similar to this. But this is really just a variation of the structure shown in Figure 2-12 and has the same problems. Imagine the query to determine which products exceeded target by more than 10 percent any time in the first quarter.