Team LiB
Previous Section Next Section

The Structure of a Dimension Table

A typical dimension table is much wider than a fact table or, indeed, a typical table in a normalized database. Whereas the majority of fact tables contain no more than three or four facts, and normalized databases with more than 20 or 30 attributes are automatically viewed with suspicion, dimension tables with 50 attributes are common, and 100 or more attributes are not unknown. In fact, this is the one situation in which it's reasonable to push the 255-field limit imposed by many relational engines.

Perhaps the best example of this is the time dimension that forms a part of almost every fact table, such as the date of a transaction or of a snapshot row. In the source (normalized) database, the date field will usually be a single SQL date. In a dimensional database, this single field will be replaced by a link to a dimension table with over a dozen attributes. Table 8-1 shows a typical example from a retail application.

Table 8-1. A Typical Time Dimension

Attribute

Example

TimeID

8302984

FullDate

14 June 2004

DayOfWeek

"Wednesday"

DayInMonth

23

DayInYear

173

CalendarYear

2004

Quarter

"Second"

FiscalYear

0304

FiscalPeriod

"2Q"

HolidayFlag

True

HolidayName

"Flag Day"

WeekdayFlag

True

WeekendFlag

False

BeginningOfMonth

False

MidMonth

True

LastDayInMonth

False

Season

"Spring"

EventFlag

False

EventName

""


There are several things to notice here. First, if you take a moment to review the attributes, you'll see that they attempt to provide direct support for every type of question an analyst might pose. Do we sell more potato chips on Mondays or Wednesdays? Do football games affect the sales of coffee? What are the products whose sales peak at the beginning and end of the month (standard pay dates)?

Second, the actual SQL date has been replaced by the surrogate key TimeID. This is probably less critical in time dimensions than in product lists, for example, but it is always good practice to replace whatever keys you extract from the source data with meaningless surrogate keys.

The primary reason for assigning surrogate keys is to insulate the dimensional database from changes to the key values in the source system. It really doesn't matter how vehemently the administrator of the source system insists that keys are not allowed to change. They will. Given enough time, the source system will change the method of establishing keys, perhaps because the transaction system is being replaced or because the organization has merged with another, or just because the administrator was bored. (Don't laugh. I've seen it happen.)

Or, worse, the source system will re-use old keys. This is a perfectly legitimate thing to do in a transactional system that typically isn't interested in the history of the organization. But re-using the keys would be devastating for a dimensional database that does track historical data. Does policy PL-3957 refer to the current homeowner's policy held by Peter Levinson, or to the Public Liability policy held by Acme Corp in 1957? The question would never arise in a transactional system, but it might in a dimensional database.

A secondary benefit of using a surrogate key defined by the dimensional database is that you can use a small, neat autonumber rather than whatever data type you receive from the source system. This isn't often an issue, but it can provide you with some extra ammunition if you're engaged in an argument with someone who doesn't understand how dimensional databases work. (This happens a lot.)

By the way, notice that the original SQL date is stored in the dimension table in the FullDate attribute. Obviously you should always keep the original value. You might also want to store the name of the transactional system that originally assigned the value as a primary key. This is useful if the systems change or if the keys are re-used. (We'll discuss the issue of changing dimensional values in detail later in this chapter.)

In examining Table 8-1, another thing to notice is that many of the attributes are (or at least could be) calculated based on one or more other attributes. Month, DayInWeek and DayInMonth, for example, are easily calculated from the full date. This would, of course, be a violation of normal form were this a normalized database. But it's not, so the rule doesn't apply.

Remember that the attributes of a dimension table are used to restrict queries. In practical terms, that often means that the available values will be displayed to users in a ComboBox. Now, if we take the Month attribute as an example, populating that ComboBox is the difference between

SELECT DISTINCT Month FROM Time;

with the "de-normalized" dimensional schema and

SELECT DISTINCT MonthName
FROM CalendarNames
INNER JOIN NormalizedDimension
ON CalendarNames.MonthNumber = Month(SQLDate);

in a normalized database schema. Even without doing formal performance testing, I'm prepared to bet that the first version of this statement is going to be a whole lot faster than the second. This performance benefit is even more evident when you're working with multiple dimensions, as you would be in a typical analysis. Remember: Storage space is cheap, time is expensive.

A similar principle explains the existence of the mutually exclusive WeekdayFlag and WeekendFlag attribute pair. Yes, selecting on

WeekdayFlag <> False

will return the same set of rows as

WeekendFlag = True

but the latter is easier for the user to understand and work with.

The next thing to notice about the attributes in Table 8-1 is that there are no abbreviations. I recently placed an Internet order and received an email confirming that I'd ordered 1 FB11 T WHi. Since the confirmation came almost immediately and I'd only ordered a single item, I was pretty sure that FB11 T WHi stood for a white, twin-size, feather bed. But ask me in a month and I won't have a clue.

The point here is not that this company sends slightly questionable confirming emails (although they do), or even that my day bed is more comfortable these days (although it is). The point is that FB11 T WHi is clearly the contents of the "description" field in the company's product master file (the product number is even worse: FB11-378095X). It probably makes perfect sense to the people who spend their working days ordering, boxing, and shipping feather beds, but as a product description in a dimensional database, it's hopeless.

To start with, you can't expect anyone who is more than a step or two away from the warehouse to remember that "WHi" stands for "White", particularly since it probably stands for "whip-stitched" somewhere else in the table. Whether it's "Wednesday" rather than "Weds", as in our table, or "White" rather than "WHi", spell the entire word out rather than asking the user to guess.

In addition to this problem, notice that there are several bits of information combined in "FB11 T WHi". It tells us that the product is a feather bed, that it's style 11 (presumablyit could be the 11th featherbed entered into the system; there's no way to know from inspection), that it's twin-sized, and that it's white. Now, there's nothing particularly wrong with embedding all this information into a single description field, provided it's also available in separate fields that can be used as selection criteria. Regardless of what the schema of the product master might be, the dimension needs to look something like Table 8-2.

Table 8-2. A Product Dimension Derived from a Sample Product Description

Attribute

Example

ProductID

10293810

ProductMasterID

"FB11-378095X"

Category

"Bedding"

Product

"Featherbed"

StyleNumber

11

Size

"Twin"

Color

"White"

Description

"Twin White Style 11 Featherbed"


Unfortunately, getting to this beautiful, neat, easily understood dimension from whatever the product master might look like, a process called scrubbing, is unlikely to be a simple process. It will almost certainly require human intervention, and can bring on as many political battles as the allocation of costs that we looked at in the last chapter.

The biggest problem you're likely to face is determining who "owns" the product masteror whatever dimension you're examiningin the first place. In the case of a product list, there are probably several groups who claim ownership: Manufacturing, Sales, Marketing, even Shipping and you'll either have to get them all on board with the scrubbing process or resort to a higher authority to resolve the politics.

Alternatively, you might be dealing with a Cinderella dimension that is doing everyone's work but is claimed by no one. It's not at all unusual for questions about a product master in particular to be met with a vague smile and a discussion of several systems that use product lists. This latter situation is actually a good thing for the dimensional database project. It means that the dimensional database can take ownership of the product master (or other dimension), scrub it once, and then keep other systems from mucking about with it. At least, that's the theory.

    Team LiB
    Previous Section Next Section