GENETIC ALGORITHMS AND DATABASE INDEXING

Finding the best set of indexes

Joe Celko

Joe is a columnist for DBMS magazine and a member of the ANSI X3H2 Database Standards Committee. He can be reached through the DDJ offices.

Imagine you've just declared all the tables for a SQL database, and now have to declare primary and secondary indexes. Most SQL implementations automatically create primary indexes (which assure uniqueness of primary keys) from the table declarations. However, secondary indexes which optimize update, delete, insert, and query search time in the database must be explicitly created and can also be unique, although they're most often not.

The easy way out is to not declare secondary indexes because small test databases run fine without them. But when the test database grows into a large production system, you're hit with a big performance issue. At this point, the solution might seem to swing to the other extreme and index everything.

But while query performance will almost certainly improve since the query optimizer can find an index whenever it needs it, performance of inserts, updates, and deletes will suffer. This is because every time the database changes, so do the indexes. At times, hashing algorithms will even have to take care of collisions unless the algorithm produces a relatively even distribution. This means that a change to the database will cost one table access plus one or more accesses per index, which can slow performance in databases with millions of rows.

For instance, given a table of (n) columns, there are actually more than (n) possible indexes on it. An index can be built on more than one column, and the order of the columns within the index is important. The formula for the number of possible indexes is the sum of all permutations of (n) columns; see Example 1.

Now consider a table with three columns, T(a,b,c). The 16 possible indexes on table T are (a,b,c), (a,b), (a,c,b), (a,c), (a), (b,a,c), (b,a), (b,c,a), (b,c), (b), (c,a,b), (c,a), (c,b,a), (c,b), (c), or none at all. The optimal solution is to declare only the indexes you absolutely need to balance query time against change times. This means that, as always, you also need to know something about the transactions that will run against the database. If the job is all queries with no changes on only one column in a table, that column needs to be indexed. Any other indexing would not be used by the queries or updates and only take up space. This is commonly called a "lookup table." However, since the indexed column is probably the primary key of the lookup table, it's indexed automatically. (Note that many databases have a column that is used extensively for access and that isn't the primary key.) Likewise, if no queries or changes use a particular column, then it doesn't need to be indexed. The secondary index is often used to establish a one-to-many relationship or many-to-many relationship with another table.

Real database problems lie somewhere between these two extremes.

Finding the Best Set of Indexes

To make life easier, let's consider only queries. Given a database schema and a set of queries, we want to find the best possible set of indexes no bigger than some limit, say (i).

Finding the optimal indexing arrangement is known to be NP-complete. This means that the work to find a solution grows much faster as the number of items in the problem increases. As a rule of thumb, if you have to inspect all possible combinations of something, the problem will likely be difficult to crack.

This doesn't mean you can't optimize indexing for a particular database schema and set of input queries, but it does mean you can't write a program which will do it for all possible relational databases and query sets.

The query information is usually given as a statistical model of the expected inputs. For example, you might be told that 80 percent of the queries will use the primary key and 20 percent will use another column picked at random. This is about what you'd know in real-world situations, since most of the accesses will be done by production programs with embedded SQL, with only a small percentage of ad hoc queries.

However, NP-complete problems do have efficient, near-optimal algorithms. Farshad Fotouhi and Carlos E. Galarce of Wayne State University Computer Science Department, for instance, have proposed using genetic algorithms to search for near-optimal indexing.

The idea behind genetic algorithms is to mirror natural selection to evolve a better algorithm. You create a set of "chromosomes" that can be modified by a set of operators based on feedback from their environment. A group of "genes" form a chromosome, which is handled as a unit. Each generation, the survival quotient of a chromosome is measured by some adaptive plan. The starting point is usually a random mixture of chromosomes, and the experiment is run for a fixed number of generations.

Fotouhi and Galarce's approach uses a single table of library-book information (classification, ISBN, title, subject, author). The gene is a binary vector with a position for each of the five attributes. A 1 means the column is indexed; 0 means it's not. For example, the gene (01011) would mean that there are indexes on ISBN, subject, and author only. Notice that the ISBN is the primary key, but no attempt is made to accept only genes with the ISBN indexed. The idea is to let the genetic process find a solution without any help.

This same chromosome pattern can also be used to represent a type of query. A 1 in a "query chromosome" means that the corresponding column is to be returned; 0 means it's not. For example, the query, "find I, THE JURY by Mickey Spillane" has the genes (00101) because the title and author are given.

This correspondence makes it simple to simulate query runs. The payoff formula is based on hitting or missing an index in a query. The optimal score is to ask for only indexed columns, which makes sense because there's a chance that a nonindexed column would require a sequential search of the table.

Fotouhi and Galarce ran a series of random queries with a known statistical distribution against the test database of one million rows. The genes with the highest scores were saved from ("survived") that test run and used to build the next test run ("generation"). The performance of the system was measured in terms of average query-response times. The system leveled out in about ten generations with a 5-bit chromosome, but took longer with a 10-bit chromosome.

The Fotouhi-Galarce experiment was based on a single table, a rare occurrence in the real world. Tables are built based on a set of functional dependencies (FDs). An FD between two data items means that if I know one value, I can determine the second. This is written "A->B" and read "A determines B." If I know the part's stock number, I can look up its weight in the inventory file.

The way I combine columns to make tables uses normal forms built from FDs. A first-normal-form (1NF) table is simply a table; you cannot avoid it in SQL. A second-normal-form (2NF) table is a table with at least one key. A third-normal-form (3NF) has only one key and no transitive dependencies. A transitive dependency would be a table with columns A, B, and C such that (A->B) and (B->C), which implies (A->C).

There are higher normal forms, but most databases have to get to at least 3NF to work without anomalies. A quick example shows why this is important. Consider a table T(department, advisor, student) for faculty advisors at a college. If Dr. Celko in computer science is deleted, so are all his students (deletion anomaly); we cannot start a department until we have an advisor for it (insertion anomaly); if Dr. Celko switches to the math department, so do all his students (update anomaly). What we needed was two tables, T1(department, advisor) and T2(student, department), to get rid of the transitive dependency.

The bad news is that it's possible to develop more than one 3NF schema from the same set of FDs.

An Example

Let's say an imaginary airline has a database for scheduling flights and pilots. Most of the relationships are obvious: Flights have only one departure time and one destination, and they can get a different pilot and be assigned to a different gate each day of the week.

The functional dependencies for the database are given in Example 2 . Example 3, provides five possible answers, although there can be many more. The query chromosome structure would have six genes (day, destination, flight, gate, hour, and pilot). To show each possible 3NF schema, we first build tables where functional dependencies for the database are the key. Once the tables are defined, we apply queries against the whole database schema, not just one table, as we did before.

Example 2: The functional dependencies for the database.

  flight -> destination
  flight -> hour
  (day, flight) -> gate
  (day, flight) -> pilot
  (day, hour, gate)-> destination
  (day, hour, gate)-> flight
  (day, hour, gate) -> pilot
  (day, hour, pilot)-> destination
  (day, hour, pilot)-> flight
  (day, hour, pilot)-> gate

Example 3: Pseudo-SQL notation for creating tables within a database schema. Type declarations and constraints are not shown, just the table names, column names, and primary keys.

  CREATE SCHEMA Normal_1
  CREATE TABLE Departures
    (flight, destination, hour, PRIMARY KEY (flight)):
  CREATE TABLE WeeklyRoster
    (day, hour, gate, flight, pilot,
     PRIMARY KEY (day, hour, gate));

  CREATE SCHEMA Normal_2;
  CREATE TABLE Departures
    (flight, destination, hour, PRIMARY KEY (flight));
  CREATE TABLE Weekly Roster
    (day, hour, pilot, flight, gate,
     PRIMARY KEY (day, hour, pilot));

  CREATE SCHEMA Normal_3;
  CREATE TABLE Departures
    (flight, destination, hour, PRIMARY KEY (flight))
  CREATE TABLE GatePilotSchedule
    (day, flight, gate, pilot, PRIMARY KEY (day, flight));
  CREATE TABLE GateFlightSchedule
    (day, hour, gate, flight, PRIMARY KEY (day, hour, gate));
  CREATE TABLE PilotFlightSchedule
    (day, hour, pilot, flight, PRIMARY KEY (day, hour, pilot));

  CREATE SCHEMA Normal_4;
  CREATE TABLE Departures
    (destination, hour, flight, PRIMARY KEY (flight));
  CREATE TABLE GateFlightSchedule
    (day, flight, gate, PRIMARY KEY (day, flight));
  CREATE TABLE GatePilotSchedule
    (day, hour, gate, pilot, PRIMARY KEY (day, hour gate));
  CREATE TABLE PilotFlightSchedule
    (day, hour, pilot, flight, PRIMARY KEY (day, hour, pilot));

  CREATE SCHEMA Normal_5;
  CREATE TABLE Departures
    (destination, hour, flight, PRIMARY KEY (flight));
  CREATE TABLE DutyRoster
    (day, flight, pilot, PRIMARY KEY (day, flight ));
  CREATE TABLE GateFlightSchedule
    (day, hour, gate, flight, PRIMARY KEY (day, hour, gate));
  CREATE TABLE GatePilotSchedule
    (day, hour, pilot, gate, PRIMARY KEY (day, hour, pilot));

A table chromosome is made up of a subset of the ten original functional dependencies. Two rules have to be obeyed by the tables. First, no combination that violates the 3NF condition is allowed; it's a "lethal mutation" and is rejected at once. Second, all columns must be present in some table in the schema; this assumes that you won't put information into a database if you aren't going to look at it.

The database schemas will be made up of more than one table, and we'll be mutating one or more tables at a time. The payoff function will have to consider joins between tables (an expensive operation), the number of tables accessed, and so forth.

A join is how SQL relates data in one table to that in another. A join query builds a result table, which is made up of columns from other tables. Given tables X1(x,b,c) and X2(x,d,e), the syntax in SQL for a simple join would look like this:

  SELECT X1.x, b, c, d, e   FROM X1, X2   WHERE (X1.x = X2.x)

This builds a result table with five columns by taking each row in X1 and concatenating it to each row in X2 (a Cartesian product), then keeping only those new rows for which the join condition (X1.x = X2.x) is true. The <table>.<column> notation is used when columns in different tables have the same name.

Notice that indexes are not mentioned in the SQL SELECT statement, unlike Xbase and other nonrelational database languages. The SQL engine decides which indexes (if any) to use and builds a query plan behind the scenes. This means that indexes can be dropped and created from the schema without changing the queries; but you do have to recompile the query plans. Indexes can both help and hurt the performance of joins. Hopefully, the query optimizer will pick the fastest search, but it can be fooled. It might pick an index when a sequential read of the table would have been faster.

For example, if I want to use the flight to find both the day and the destination, I have to join TABLE Departures to TABLE GateFlightSchedule in the Normal_4 schema.

The operations on the schema will now be more complex than those we used for indexes on a single table. We have to allow tables to combine or split. The goal is to have the smallest number of tables used in the queries to avoid the cost of joins. Once the tables are determined for the set of queries, we can apply the index genetic algorithm to the tables.

Conclusion

To implement a system like this using existing system facilities, start with the query optimizer in most mainframe SQL packages that has the name of the indexes used in its query plans. The system tracks the CPU usage of each job on several different parameters--time, resources, working storage used, and so on. Most shops have repeated workloads which are close to identical in size and mix on weekly or monthly cycles. The statistics are fairly simple; the real trick is deciding on a payoff function.

Every large-scale database package has a utility program to trap statistics--the number of accesses to a table, resources used by a job, and so on. Next, have the indexes automatically appear or disappear as needed, a process that's easy in SQL using the commands DROP <index name>; and CREATE INDEX <index name> ON <table>(<column list>);, usually with vendor extensions on the CREATE INDEX statement. Remember to recompile the execution plans.

You can't reorganize the database schema because the queries would not work. Of course, as the input changes the old schema might not be the best possible, so you would have to run the genetic system on a regular basis or whenever the performance of the system declined dramatically.

References

Comer, D. "The Difficulty of Optimum Index Selection." ACM Transactions on Database Systems (vol. 3, 1978).

Fotouhi, Farshad and Carlos E. Galarce. "Genetic Algorithms and the Search for Optimal Database Index Selection," in Lecture Notes on Computer Science #507. Berlin: Springer-Verlag, 1991.

Goldberg, D.E. Genetic Algorithms. Reading, MA: Addison-Wesley, 1989.

Paitetsky-Shapiro, G. "The Optimal Selection of Secondary Indexes in NP-Complete." SIGMOD Record (vol. 13-N2, 1983).

Back to Table of Contents