The Relational Model

The relational model is based on a collection of mathematical principles drawn primarily from set theory and predicate logic. These principles were first applied to the field of data modeling in the late 1960s by Dr. E. F. Codd, then a researcher at IBM, and first published in 1970.^[1] The rules of the relational model define the way data can be represented (data structure), the way data can be protected (data integrity), and the operations that can be performed on data (data manipulation).

^[1] Codd, E. F. "A Relational Model of Data for Large Shared Data Banks," Communications of the ACM, Vol. 13, No. 6 (June 1970).

The relational model is not the only method available for storing and manipulating data. Alternatives include the hierarchical, network, and Object/Data models. Each of these models has its advocates, and each has its advantages for certain tasks. But because of its efficiency and flexibility, the relational model is by far the most popular database technique and is the one discussed in this book. Both the Microsoft Jet database engine and Microsoft SQL Server implement the relational model.

We'll also be discussing the dimensional model, a special kind of relational schema used for tracking historical data. We'll look at the dimensional model and its relationship to the relational model in detail in Part III.

In general terms, relational database systems have the following characteristics:

All data is conceptually represented as an orderly arrangement of data into rows and columns, called a relation.
All values are scalar. That is, at any given row/column position in the relation there is one and only one value.
All operations are performed on an entire relation and result in an entire relation, a concept known as closure.

If you've worked with Microsoft Access databases at all, you'll recognize a "relation" as a "recordset" or, in SQL Server terms, as a "result set." Dr. Codd, when formulating the relational model, chose the term "relation" because it was comparatively free of connotations, unlike, for example, the word "table." It's a common misconception that the relational model is so called because relationships are established between tables. In fact, the name is derived from the relations on which it's based.

Notice that the model requires only that data be conceptually represented as a relation; it doesn't specify how the data should be physically stored. This separation of the conceptual and physical representations, although it seems obvious now, was a major innovation 30 years ago when database programming generally meant writing machine code to physically manipulate the data storage devices.

In fact, relations need not have a physical representation at all. A given relation might map to an actual physical table someplace on a disk, but it can just as well be based on columns drawn from half a dozen different tables, with a few calculated columnswhich aren't physically stored anywherethrown in for good measure. A relation is a relation provided that it's arranged in row and column format and its values are scalar. Its existence is completely independent of any physical representation.

The requirement that all values in a relation be scalar can be somewhat treacherous. The concept of "one value" is necessarily subjective, based as it is on the semantics of the data model. To give a common example, a "Name" might be a single value in one model, but another environment might require that the value be split into "Title", "Given Name", and "Surname", and another might require the addition of "Middle Name" or "Title of Courtesy". None of these is more or less correct in absolute terms; it depends on the use to which the data will be put.

The principle of closurethat both base tables and the results of operations are represented conceptually as relationsenables the results of one operation to be used as the input to another operation. Thus, with both the Jet database engine and SQL Server we can use the results of one query as the basis for another. This provides database designers with functionality similar to a subroutine in procedural development: the ability to encapsulate complex or commonly performed operations and reuse them whenever and wherever necessary.

For example, you might have created a query called FullNameQuery that concatenates the various attributes representing an individual's name into a single calculated field called FullName. You can create a second query using FullNameQuery as a source that uses the calculated FullName field just like any field that's actually present in the base table. There is no need to recalculate the name.

This is, of course, a simple example, and it's probably not immediately obvious that calling FullNameQuery provides any significant advantages over simply recalculating the full name. But as we'll see, the SQL language allows for the generation of extremely complex queries, and as the complexity increases, so do the benefits of using existing queries as input for new ones.