Before describing the database schema, you must make some decisions about the architecture your system requires. Unfortunately, the literature uses the term "architecture" to describe two distinct (although interrelated) models. To clarify things, I'll call one of these models the code architecture and the other the data architecture. But be aware that these are my names; you're unlikely to find them elsewhere.
What I call the code architecture is variously called the "application model," the "layered paradigm," and the "services model" in the literature. The code architecture describes the way code is logically structured. Code structure is largely an implementation issue, and as such it lies outside the scope of this book. However, the code architecture can affect whether data integrity constraints are implemented in the database schema, and so we will discuss it here, although somewhat superficially.
In the bad old days, system architectures were monolithic: huge blobs of code with minimal structure. Anyone who's had the misfortune of trying to modify (or even understand) a monolithic system of any complexity will never look at a plate of spaghetti in the same way again. To impose some order on this mess, programmers began to structure their code into discrete components in various ways: subroutines, modules, or objects, depending on language capabilities. The problem with this approach is that instead of creating spaghetti, you can easily create tortelliniindependent chunks of code that interact in some way, but it's anybody's guess how.
To manage this modern pasta, many developers are organizing components into services, sometimes called "layers," which perform tasks at a discrete logical level. There are innumerable ways of organizing the layers. We'll look at two of the most common types: the three-tiered and four-layer models. (Please don't ask me why when you have three of them, they're "tiers," and when you have four, they're "layers." Just to confuse us, I suspect.)
The three-tiered model organizes components into User Services, Business Services, and Data Services. Code components that present information to users and respond to user actions are assigned to the User Services tier. The entire user interface is encapsulated within this tier. The Business Services tier is responsible for enforcing business rules and validating user input. Business Services components interact with both the User Services and Data Services tiers. Code components in the Data Services tier, which interact only with the Business Services tier, are responsible for maintaining data.
The three-tiered model is clean, but I've found its use in actual development problematic. There always seem to be certain types of functionality that don't clearly belong to any specific tier. For example, say a given piece of data needs to be formatted before it's displayed to users. A social security number might be stored as a string of nine digits but displayed as 999-99-9999. Does the formatting belong to the User Services or the Data Services tier? You could make a case for putting it in either tier. Similarly, is transaction management part of Business Services or Data Services? When you begin to design complex systems using hierarchical data and data shaping, these types of decisions can get hairy.
Provided that you're consistent, I suppose it doesn't really matter where you put these types of functions, but this is precisely where the model fails. If you must refer to a set of external conventionssuch as "formatting belongs to the User Services tier, and building hierarchical data sets is a part of the Business Services tier"then the overhead of the model begins to outweigh its benefits.
Dividing the code architecture into four layers instead of three eliminates many of the problems associated with the three-tiered model. The four-layered model, often called the "layered paradigm," organizes code components into a User Interface layer, a Data Interface layer, a Transaction Interface layer, and an External Access Interface layer, as shown in Figure 13-1.
Figure 13-1. The Four-Layer Model
The User Interface layer corresponds to the User Services tier of the three-tiered model. The User Interface layer is responsible for user interaction, including presenting information to users by means of window objects; responding to changes in the state of window objects, such as a form resize; and initiating user requests.
The Data Interface layer is responsible for maintaining data in memory (as opposed to maintaining it permanently, which is handled by the External Access Interface layer and the database engine, as we'll see). It explicitly includes most of the functionality that can be contentious in the three-tiered model, such as being able to format data and create virtual recordsets. (A virtual recordset exists only in memory; it's not permanently stored anywhere.)
In most cases, a component in the Data Interface layer is tightly coupled with a specific component in the User Interface layer. Theoretically, however, a Data Interface component can support multiple User Interface components, as Figure 13-1 shows. A system might include, for example, a Customer Maintenance form showing information for a single customer and a Customer Summary form showing information for multiple customers. Since both of these forms represent a Customer entity, they might share the code for formatting and validating the CustomerNumber, which is a function of the Data Interface layer.
In physical terms, a User Interface component usually corresponds to a form in Microsoft Visual Basic or Microsoft Access, and the associated Data Interface component is typically implemented in the form's class module. Certain procedures might be shared by multiple forms, however, in which case the procedures would be implemented in a shared module.
The Data Interface layer is responsible for validating data but is not responsible for business processes. The code that ensures that the Customer-Number value specified in an order is known to the system, for example, is part of the Data Interface layer. The code that enforces a specific sequence of events, such as preventing an order from being shipped until the customer's credit is approved, belongs to the Transaction Interface layer.
The Transaction Interface layer coordinates the use of data by the application. Components at this level are responsible for building and initiating queries, receiving information from the External Access Interface layer, enforcing business processes, and handling errors and violations reported by the External Access Interface layer.
The components in the Transaction Interface layer are more likely to be reused than Data Interface components. Transaction Interface layer components are good candidates for implementation as objects in both Visual Basic and C#. A Customer object, for example, might expose an Update method that is called by multiple components in the Data Interface layer. Note that because the Data Interface layer should (at least ideally) be independent from components at the User Interface layer, the values to be updated must be listed explicitly. In other words, the call would take the following form:
MyCustomer.Update CustomerNumber, CustomerName ...
The Update method would then create the UPDATE query statement, hand it to the External Access Interface layer for execution, and handle any error conditions that might result, either by resolving them directly or by handing them back up the chain for display by the User Interface layer.
The External Access Interface layer is responsible for communication between the application and external data sources. In database systems, the code components at this level handle the communication with the database engine. They execute queries and hand the results (including any error messages) back up the component chain.
Ideally, you should design the procedures at this level to isolate the transaction from the specifics of the database engine you choose. Theoretically, it should be possible to upsize an application originally designed to use the Microsoft Jet database engine to SQL Server simply by replacing the External Access Interface layer. In reality, this can be a little difficult to pull off.
Remember that the Transaction Interface layer is responsible for building queries; the External Access Interface layer only executes them. Given the syntactic differences between various implementations of the SQL language, implementation is rarely so straightforward. A recordset created by a single TRANSFORM query against a Microsoft Jet database requires multiple statements to be created using SQL Server.
If you can predict all the queries executed by the application in advance, you can avoid syntax problems by including the queries in the database schema. Parameter queries can be enormously helpful here. You don't need to create the SQL statement to select the customer named Jones on the flyyou can pass "Jones" in as a parameter to a pre-existing query, thereby simplifying your code architecture and probably improving performance.
Unfortunately, it isn't always possible to predict all the necessary queries in advance, particularly if you're providing an ad hoc query capability to users. In this case it's almost impossible to insulate the Transaction Interface layer entirely. (At any rate, I've never found a complete solution to the problem; if you figure it out, I'll be grateful if you'd share it with me.) In the meantime, you might have to write some conditional code in the Data Interface layer components.
If your application is intended to support a single database engine, it's tempting to roll the Transaction Interface layer and the External Access Interface layer together. I don't recommend it. Although it does take a certain amount of time to design the External Access Interface layer, the process isn't difficult, and once written it will save you hundreds of lines of code elsewhere in the system. Furthermore, once you've written a component that communicates to, say, SQL Server 2000 using ADO.NET, you need never write it again. You can use it in any other systems you write, without modification. The only time you need a new External Access Interface layer component is when either the underlying database engine or the object model changes.
Code Architectures and the Database Schema
The code architecture you choose impacts the database schema in two areas: insulation of the External Access Interface layer (or the Data Services layer, if you're using the three-tiered model) and data validation. We've already talked about insulating the External Access Interface layer from a change in database engines by predicting necessary queries and including them in the database schema. This approach has the added advantage of improving performance, sometimes significantly. Data validation is a somewhat trickier issue. We'll look at the "what" and "how" of data validation in detail in Chapter 19. What we'll consider here is the "when" and "where."
Some designers advocate embedding all data validation functionality in the database engine itself. This approach isn't without merit: All of the data integrity constraints and business rules are implemented in a single place, where they can be easily updated. Unfortunately, the approach is not without problems, either.
In the first place, certain rules can't be implemented at the database engine level. For example, without triggers in the Jet database engine, it's impossible to enforce a rule that prevents a primary key value from being changed once the record is created. Even in SQL Server, which is generally more powerful in this area, you can't implement every rule directly in the database engine.
SQL Server Yukon allows triggers, code and stored procedures to be implemented in any procedural language, which will remove any effective limits on implementing constraints at the server level.
Secondly, waiting until the data has been submitted to the database engine before validating it can reduce the usability of the system. As a general rule, you should validate data as soon as it's entered. In some cases, this means validating the data as soon as a key is pressed, such as when you prevent the entry of alphabetical characters in a numeric field. In other cases, you should validate the data when a field is exited or when the last of a sequence of fields is entered, such as when you're enforcing the rule that the DesiredDeliveryDate value must be equal to or later than the Order-Date value.
Even in a standalone application running on a single machine, submitting a data set to the database engine after each keystroke, or even after each field is exited, results in deplorable performance. If you're making a round-trip across a network or, heaven forbid, across a wide area network or the Internet to a database engine at a remote site, performance will be so poor your users will be better off using index cards (and they might).
The only solution, if all validation is handled by the server, is to submit the data to the database engine for validation after the entire record has been completed. But by this point, a user's attention will have turned to the next task. To report a problem that resulted from an entry made 10 minutes ago is disruptive and confusing.
To make the system as responsive and usable as possible, then, you must implement data validation in the application. If the database is used by only a single application and the validation requirements are relatively stable, you might decide to implement data validation only at the application level and entirely forego any validation by the database engine. This eliminates duplicate effort, but it's a fairly dangerous approach.
If another application is implemented against the same database in the future, nothing but good intentions prevent the new application from inadvertently undermining the integrity of the database, and we all know which road is paved with good intentions. Even if the database is never shared with another application, it's vulnerable to users using ad hoc tools such as Access or the SQL Server Enterprise Manager to manipulate the data. A strict security model that prohibits data changes except by the application itself can help prevent this, but at a cost of restricted access to the data, which may be inappropriate.
For these reasons, I believe that best practice is to implement data validation in both the application and the database schema. Access implements this automatically. When you define a validation rule for a field at the table level and then drag that field onto a bound form, the form inherits the validation rule.
In versions of Access prior to 2000, this feature is unfortunately also a good example of the problem with duplicate validation. If, after including the field on a bound form, you then change the validation rule at the table level, the changes are not reflected in the form. Microsoft Access 2000 does update the rule, but the problem remains in Visual Basic. If you reference the field on multiple forms (in the same or multiple applications), you must manually change the validation rules in each of the forms (or rather, in the Data Interface layer components supporting each of the forms).
To overcome this problem, you can query the database engine for validation rules at runtime. This technique has a certain amount of overhead, but if your validation rules change frequently, the overhead might be offset by the ease of updating the rules in a single place.
Retrieving the validation rules from the database engine can be done when the application first starts, when a form is loaded, or before each record is updated. I recommend doing it when the form is loaded. If you do it on application startup, you might download unnecessary information pertaining to forms a user never loads. If you do it before each record is updated, you can be absolutely sure that the rules you're working with are as current as possible, but it means re-querying the schema for each record and, realistically, how many systems are that volatile?
In the extremely unlikely event that a rule changes while a user has a form open and the user enters data that complied with the old rule but violates the new rule, the problem will be caught by the database engine anyway, so no major harm will be done. Besides, the very thought of tinkering with the database schema while the system is being used is enough to give me nightmares.
In addition to making decisions about how the code for your system should be structured, you must also decide on a data architecture. You will recall from Chapter 1 that a database system consists of a number of discrete components: the application itself, the database engine, and the database. (See Figure 1-1.) Based on the four-layer code model, we can now refine this structure somewhat, as shown in Figure 13-2.
Figure 13-2. A Database System Consists of Six Discrete Layers
In determining the data architecture for the application, you decide where each of the layers will live. Theoretically, each layer (or even each individual component) can exist on a different computer, communicating across a network of some kind. At the other end of the scale, all the components can exist on a single, standalone computer. In reality, a few more-or-less standard configurations have been shown to be effective in various situations, and we'll examine each of these in turn.
Each logical grouping of components in the data architecture is referred to as a tier. The simplest architecture, of course, is a single-tier system, in which all the components exist in a single logical tier, and the simplest version of a single-tier data architecture is a standalone system.
In a standalone system, all the components exist on a single machine and are available only to the user physically working at that machine. Although the machine might incidentally be connected to a network or the Internet, the database system is not available to other users. Because all processing occurs on the local machine and the data is stored locally, the only performance constraints are the capabilities of the machineits processor speed and memory. Standalone systems tend to be very memory-hungry; that's one of the reasons other configurations are sought.
The majority of standalone systems use the Jet database engine, although Microsoft is encouraging developers to use the Microsoft Data Engine (MSDE), a standalone version of SQL Server distributed as part of the Microsoft Data Access Components (MDAC). It's certainly possible to implement a SQL Server system on a single machine, but for reasons that will become clear later, it's debatable whether such a system qualifies as single-tier.
A common variation in the single-tier architecture is the networked database. In this model, you physically locate the database (or at least parts of it) on a computer across a network, but all processing is performed locally.
Don't be tempted to put the application itself on the network drive. This is theoretically possiblebut definitely not recommendedbecause of the load it puts on the network. Instead, you should put the application on the local computer and use linked tables to access the networked data.
A networked databasewhich is possible only using the Jet database engine, not SQL Serverallows multiple users to access the data simultaneously. The maximum number of simultaneous users of a Jet database is theoretically 255. In reality, the practical maximum depends on what they're doing and, ultimately, on how efficient the system design is. Clearly, 20 people entering data as fast as their sweet fingers can type are going to put more strain on a system than 50 people reviewing sales and pondering product strategy.
Reducing the network load has the most direct effect on the database schema. Remember that all processing occurs on the local machine; in a sense, you treat the computer on which the database resides as nothing more than a remote hard disk. But response time across a network is typically much slower than accessing a local hard drive. Furthermore, the network has a limited capacity, and all users on the system must compete for it. So you want to reduce the amount of information you're shuffling back and forth. Again, this is largely an implementation issue; if you're directly involved with implementation, I suggest you consult the sources listed in the Bibliography that accompanies this book for more information.
However, two aspects of the database schema can directly impact network performance: the location of database objects and the appropriate use of indices. I've already mentioned the importance of having the user interface objects stored locally. In addition, you might want to consider placing copies of data that doesn't change very often on users' machines.
For example, product lists tend to be fairly stable, and they're typically referenced frequently. Provided the Products table isn't too big (a few megabytes is fine; a gigabyte is pushing it), you might want to consider storing a copy of it on each user's machine. In most situations, this will reduce network traffic and improve performance. Of course, you'll need to provide some mechanism for updating the data when necessary, but this doesn't present too much of a problem. If the system is implemented using Microsoft Access, replication will do the job beautifully.
Lists of postal codes, states, countries, and an organization's regions or branches are also good candidates for local storage, since these lists tend to be small and stable. They also tend to be referenced frequently. (There's no point in storing a copy of a table that's used only by the system administrator once a year on every user's workstation.) Sales orders or lists of customers or students, on the other hand, are typically not appropriate for local storage. The data in these tables changes frequently, and sharing the most current version is the whole point of a networked database.
The second way the database schema can affect network performance is in the use of appropriate indices. You can think of an index as a sort of mini-table that is maintained in a specific order. It contains only the fields required to order the records and a "pointer" to the record in the real table, as shown in Figure 13-3.
Figure 13-3. An Index Is a Kind of Mini-Table, with Records Stored in a Specific Order
Note that I'm using the word "pointer" rather loosely here. This pointer isn't the same as a "memory pointer" or an "object pointer" as those terms are used in programming. In fact, the physical implementation of indices doesn't match the model presented here very closely at all, but the model will work. If you really want to know the ugly details, the Microsoft Jet Database Engine Programmer's Guide is a good place to start.
Instead of physically rearranging the records in the base table, a task that would be prohibitively time-consuming in most situations, Microsoft Jet sorts only the index file. This is (usually) substantially faster and also allows the base table to be quickly and easily accessed in multiple orders, since more than one index can be maintained for any given table.
SQL Server provides a special kind of index, called a clustered index, that does control the physical ordering of data. There can be only one clustered index per table.
The importance of indices from a performance point of view is that in many cases the database engine can perform an operation using only an index, without reading the base table itself. This often provides a noticeable performance improvement even for standalone systems (even reading data from a local drive takes time), and in networked situations the improvement can be critical.
To take just one example, say you have a Customers table containing 100,000 records, each 1500 bytes long. The application needs to find a specific record, say the one for Jones Construction, whose CustomerID is JONSCON. You execute a statement like the following:
SELECT * FROM Customers WHERE CustomerID = "JONSCON"
Without an index or a primary key declared for the table, the Jet database engine must read each of those 100,000 records to determine which ones match the specified condition. That's at least 150 megabytes of data being sent across the network. If the CustomerID is indexed, either explicitly or by declaring it a primary key, the Jet database engine needs to read only the index, which is probably just a few kilobytes, and can then quickly locate the correct record in the base table.
The performance improvement from using indices can be amazing, but of course you don't get it for free. There's a certain amount of overhead involved in using indices: Every time you add or update a record, the database engine must update the indices on the table. Usually this overhead is negligible, but if you are using too many indices on any one table, it can start to impact performance. Taken to extremes, the time required to maintain the indices will exceed the time saved by using them.
In a two-tiered architecture, the database and the database engine are both located on a remote computer. They can be on the same computer or on different ones. In fact, the database can be spread across several physical computers; logically, the system will still be two-tiered. This architecture is possible only using SQL Server or another database server such as Oracle; it is not possible with Microsoft Jet.
At first glance, the difference between a networked database and a two-tiered system might not seem great. Use Microsoft Jet or SQL Server on a remote computerwhat's the big deal? The big deal is that in a networked database all the processing is performed on the local workstation, but in a two-tiered system the processing is spread across two processors. The workstation is responsible for handling user interaction, and the remote computer handles the data access. SQL Server performs all the data manipulation, including query execution, and "serves" the results to the client workstation. For this reason, two-tiered database systems are better known as client/server systems.
To make the difference between a networked database and a two-tiered system clearer, take the following SQL statement, which we used when we talked about the importance of indices:
SELECT * FROM Customers WHERE CustomerID = "JONSCON"
In a networked database, the Jet database engine will read the index (assuming one is present), determine the correct record, and then fetch the record. The index and the record are both transported across the network. In a client/server system, the application will submit the statement to SQL Server and get the correct record in return. Only the record is transported across the network. (Of course, what actually happens in each situation is rather more complicated, but this simple model will serve.)
With a request like this, the performance of either architecture will be good enough that you might not see much difference between a networked database and a client/server database. But with complex queries and scores of users, using a client/server system significantly improves performance and responsiveness.
In addition to reducing network traffic, system responsiveness can also improve in a client/server system because while the server is busy calculating the results of a command, the workstation can be doing something else, like responding to additional user requests. The reverse is also true. While the workstation is busy responding to a user (or waiting around for the user to do something), the database server is free to process other requests. SQL Server is more sophisticated than Microsoft Jet in many ways, but this system responsiveness is essentially why client/server systems can support more users than a networked database. This is yet another case of the whole being greater than the sum of its parts.
To make a client/server system work, you must push as much data processing onto the server as possible. Whereas in a networked database it often makes sense to store queries locally, in a two-tiered client/server system they should remain on the server.
If you're linking to SQL Server through Access, you need to be careful about issuing commands that will execute locally. For example, a SELECT statement that contains a user-defined function will be executed by Access rather than being passed to SQL Server, because SQL Server doesn't support Access functions in SELECT statements (neither does Microsoft Jet, which is why a query containing a user-defined function can't be executed from Visual Basic, even if the query is stored in an .mdb file). If you force local evaluation, you'll lose all the benefits of having the database engine perform the data manipulation.
From an implementation standpoint, there are many more issues to consider in building client/server databases. Again, I refer you to one of the many excellent books available on the subject, some of which are listed in the Bibliography.
Spreading the processing load between two systems in a two-tiered system can, if implemented correctly, significantly improve application performance and responsiveness. Spreading the load across additional systems can provide similar benefits. Referring back to the four-layer code architecture shown in Figure 13-1, it's usually the Transaction Interface layer and External Access Interface layer components that are distributed to additional intermediate systems.
Unfortunately, the implementation complexity seems to increase exponentially, as well. Connectivity, security, process managementall these issues become infinitely more complex when you move to three or more logical tiers. Since the complexity of these systems often requires additional servers of various descriptions, such as the Microsoft Transaction Server, these systems are usually referred to as "n-tiered." (Apparently, with physical tiers, as with ex-husbands, one stops counting after the third.)
Fortunately, this implementation complexity is just thatan implementation issue. Your development team might decide to explore a new career in basket weaving, but an n-tiered architecture doesn't affect the design of the database very much. You must be especially rigorous in keeping the logical levels of the code architecture distinct, but a database schema that performs correctly in a two-tiered environment should scale to n-tiers without alteration.
Deploying a database across the Internet or an intranet is essentially a special form of n-tiered architecture. The specifics of the technologies are dif ferentyou'll use HTTP as a transfer protocol, and the user interface is more likely to be Internet Explorer than Accessbut logically the architectures are very similar.
The most important difference between a database system deployed on the Internet and one deployed in a more traditional environment is that the Internet is stateless. Unlike an Internet application, in a typical client/server environment the application will request a username and password when it first starts up and then use this information to connect to SQL Server. Once the connection is established (assuming the username and password are accepted), the server usually maintains the connection for the duration of the session. As long as the connection is maintained, the server knows who the client is, and when the client makes a request, it can respond appropriately. All this "I know who you are" business is called a state and is maintained by the database server.
When the database system is deployed across the Internet, however, the database server will no longer maintain state information. Each time the application makes a request from the database server, it must reestablish the connection and re-identify itself. Once the database server has complied with the request, it forgets all about the application that made it.
Detached architectures used to be restricted to the Internet and intranets, but this is the architecture enforced by ADO.Net, no matter what the physical architecture, so they're becoming difficult to avoid.
Under most circumstances, the slight overhead involved in making a new connection for each request has very little impact on the database system, and it certainly doesn't affect the database schema. But the stateless nature of the Internet has another implication that does affect the application and might also require changes to the database schema.
Most Internet applications aim for a thin client. This means that applications do as little processing as possible on the client, usually just the user interface. But consider the case of a query that returns a lot of records, more than can be reasonably displayed on a single screen. In a traditional application, the result set is cached, either on the client or the server.
But in an Internet application, the results can't be cached on the server side because the server wouldn't know where to send the next batch. If they're cached on the client side, the data-handling components (the Transaction Interface and External Access Interface layers) must reside on the client as well, and these components are certainly not "thin." They're decidedly pudgy, in fact.
ActiveX Data Objects (ADO) and the Net Framework both provide a mechanism called paging for handling these situations. Paging allows you to return a specified number of records in the resultset. It's rather like the TOP N clause of a standard SQL SELECT statement, except that you have a "middle N" capability as well.
On the server side, paging results in the re-execution of the query each time a user requests a new page. For queries having a reasonably fast response time this represents no problem. For complex queries that are relatively slow to calculate, you've got big problems. An application that makes a single user wait a couple of minutes for a response is marginally acceptable. An application that makes several thousand users wait minutes every time they want to see the next five records is going to be pitched in the rubbish bin where it belongs.
If you're faced with a complex query in a detached application, you have a couple of options. The first, and in most cases preferable, option is to optimize the query to within an inch of its life. Create temporary tables, de-normalize the data, do whatever you need to get the response time down to an acceptable level.
If this isn't possible, you might have no choice but to create a fat client by moving the data-handling components out to the client side. This architecture allows you to cache the query results on the client side, effectively reproducing an environment similar to a networked database (with a few complications, just to keep you on your toes).
However, as a general rule, fat clients are more appropriate on an intranet than the Internet, Microsoft's Web Services Architecture notwithstanding. Many people have a perfectly reasonable aversion to downloading code components, but this situation is less likely to arise in an application that's publicly available. If it does arise, you can only hope that your content is sufficiently valuable to overcome their resistance.