Changing Dimensions

So far in our discussion of dimension tables we've treated them as though they were largely static. We did discuss the use of surrogate keys to insulate the data from changes, but we haven't yet explored how to handle the changes when they occur. And they will occur, of course.

You have three basic choices when the values of attributes in a dimension row change:

You can overwrite the old values with new ones (losing history),
You can add a new row (maintaining all historical information, but complicating the design), or
You can maintain both the original and the current values (losing interim values)

Overwriting the old values is the simplest technique for handling dimension changes. This is the technique most often used in transactional databases that typically only care about the current state of entities. But dimensional databases are interested in historical trends, and of course simply overwriting old values causes problems in that context.

Say, for example, that I am added to a Customer dimension in March 2000 as a female living in Amsterdam, The Netherlands (as I was at that time), and $1,175.32 worth of sales transactions are added to the fact table, assigned to my CustomerID. In April of 2002, my address changes to Santa Fe, New Mexico (as it did), and an additional $1,053.97 worth of sales are assigned to me.

Now, the order entry system doesn't care that I used to live in Amsterdam. It just wants to know where to send my bill. But if the values in the dimension database were overwritten, and an analyst in 2004 asked for the sum of the last five years' orders placed by residents of New Mexico, the system would erroneously include the $1,175.32 I spent while living in Amsterdam. Oops.

So the dimension database does care about historical values, and you can't blithely discard or overwrite them. The only time this technique is used is when the original values were in errorif I had never lived in Amsterdam in the first place. The problem, of course, is in knowing whether a change uploaded from the source system is an error correction or an actual change, since most transactional systems don't track this information.

I'm afraid that tackling this problem is, once again, a matter of policy. It might be possible to alter the source system, although this is, frankly, unlikely. Occasionally management is prepared to live with educated guessesany change made within a certain time period, seven days for example, is considered a correction. After that, the system treats it as an actual change. Most often, the status of the change will have to be determined manually as part of the scrubbing process.

Assuming that you've determined that the change is, in fact, a change and not a correction, then you must choose one of the other two ways of handling it. Your choice will depend (as it so often does) on what the data means and how it is used. The choices, by the way, are not mutually exclusive. You may decide to create a new record when some fields change, but use an original/current pair for others in the same dimension.

Creating a new row when a change occurs is the most powerful solution, but it requires that the dimension identifier be expanded somewhat to include both a general identifier and a sequence number. So, for example, to maintain the fact that the Rebecca M. Riordan who currently lives in New Mexico is the same Rebecca M. Riordan who used to live in Amsterdam, the dimension would assign a general CustomerID of, for example, 35972, and a sequence number of 1 to the Amsterdam address and 2 to the New Mexico address. At any given time, my full dimensional identifier would be 35972-x, where x is the current sequence number.

In order to make it easier to pull out the most recent value, some analysts use an arbitrary number, typically 999, to indicate the current sequence. In this case, the system would have initially assigned 999 to the Amsterdam address, and then when I moved, assigned 1 to Amsterdam and 999 to New Mexico. Provided you can ensure that this policy is not overridden, this approach does, indeed, make "current value" analysis far simpler.

In addition to tracking changes, you sometimes need to know when a change occurs. There are two methods for doing this. You can store the effective date in the dimension table, which is useful for tracking the history of a particular dimensional entity or performing queries against the dimension table itself. Or you can store both the general identifier of the entity and the current sequence number in the fact table, which correctly and easily partitions the facts chronologically.

Most authors recommend the latter approach, but in reality they're not mutually exclusive, and I think it safest to combine them, storing an effective date in the dimension and the sequence number in the fact table. Remember, space is cheap...

The final approach to handling changing values is used when you don't need to track historical data so much as you need to track only the current and original (or previous) values simultaneously. This structure would, for example, make it easy to answer questions such as "How many of our New Mexico customers moved here from Amsterdam?" Now, this is a decidedly pathological example in a Sales environment, but it might make sense for a system that is interested in demographic trendsdisease control systems, for example, ask questions like this.

In a sales environment, the technique is used for so-called "soft" changes, when the effective date is somewhat fuzzy and for some period of time it's legitimate and reasonable to pretend that the change hasn't occurred. The classic example of this is product formulations. Manufacturing may begin production of YaYa Gel using formula B on the first of July, but there will probably still be a supply of YaYa Gel made using formula A in the warehouse, in transit, and in stores. In most organizations, there will be an unpredictable period of time during which the YaYa Gel sold could be made by either formula.

The next time you're in the grocery store, check the number of products whose ingredient list contains the warning that it "may contain peanuts or peanut by-products." The formulation of these products changes slightly with each batch, vegetable oil or peanut oil being chosen based on commodity pricing. These are the types of situations in which you might use the Current/Original or Current/Previous attribute pair to track changes.

These situations are rare, however. In reality, you will more often be faced with "do I care or not" questions than "in what way do I care" decisions, and when you are faced with one of the latter, it's been my experience that the choice between the new row and current/previous pair techniques is, in practice, obvious.