For large amounts of input, the linear access time of linked lists is prohibitive. In this chapter we look at a simple data structure for which the running time of most operations is O(log *n*) on average. We also sketch a conceptually simple modification to this data structure that guarantees the above time bound in the worst case and discuss a second modification that essentially gives an O(log *n*) running time per operation for a long sequence of instructions.

The data structure that we are referring to is known as a *binary search tree*. *Trees* in general are very useful abstractions in computer science, so we will discuss their use in other, more general applications. In this chapter, we will

See how trees are used to implement the file system of several popular operating systems.

See how trees can be used to evaluate arithmetic expressions.

A *tree* can be defined in several ways. One natural way to define a tree is recursively. A tree is a collection of nodes. The collection can be empty, which is sometimes denoted as A. Otherwise, a tree consists of a distinguished node *r*, called the *root*, and zero or more (sub)trees *T*_{1}, *T*_{2}, . . . , *T _{k}*, each of whose roots are connected by a directed

typedef struct tree_node *tree_ptr;

struct tree_node

{

element_type element;

tree_ptr first_child;

tree_ptr next_sibling;

};

There are many applications for trees. One of the popular uses is the directory structure in many common operating systems, including UNIX, VAX/VMS, and DOS. Figure 4.5 is a typical directory in the UNIX file system.

The root of this directory is */usr*. (The asterisk next to the name indicates that */usr* is itself a directory.) */usr* has three children, *mark*, *alex*, and *bill*, which are themselves directories. Thus, */usr* contains three directories and no regular files. The filename */usr/mark/book/ch1.r* is obtained by following the leftmost child three times. Each / after the first indicates an edge; the result is the full *pathname*. This hierarchical file system is very popular, because it allows users to organize their data logically. Furthermore, two files in different directories can share the same name, because they must have different paths from the root and thus have different pathnames. A directory in the UNIX file system is just a file with a list of all its children, so the directories are structured almost exactly in accordance with the type declaration above.^{*} Indeed, if the normal command to print a file is applied to a directory, then the names of the files in the directory can be seen in the output (along with other non-ASCII information).

void

list_directory ( Directory_or_file D )

{

list_dir ( D, 0 );

}

void

list_dir ( Directory_or_file D, unsigned int depth )

{

/*1*/ if ( D is a legitimate entry)

{

/*2*/ print_name ( depth, D );

/*3*/ if( D is a directory )

/*4*/ for each child, c, of D

/*5*/ list_dir( c, depth+1 );

}

}

This traversal strategy is known as a *preorder* traversal. In a preorder traversal, work at a node is performed before (*pre*) its children are processed. When this program is run, it is clear that line 2 is executed exactly once per node, since each name is output once. Since line 2 is executed at most once per node, line 3 must also be executed once per node. Furthermore, line 5 can be executed at most once for each child of each node. But the number of children is exactly one less than the number of nodes. Finally, the *for* loop iterates once per execution of line 5, plus once each time the loop ends. Each *for* loop terminates on a *NULL* pointer, but there is at most one of those per node. Thus, the total amount of work is constant per node. If there are *n* file names to be output, then the running time is *O*(*n*).

/usr

mark

book

chr1.c

chr2.c

chr3.c

course

cop3530

fall88

syl.r

spr89

syl.r

sum89

syl.r

junk.c

alex

junk.c

bill

work

course

cop3212

fall88

grades

prog1.r

prog2.r

fall89

prog1.r

prog2.r

grades

Another common method of traversing a tree is the *postorder* traversal. In a postorder traversal, the work at a node is performed after (*post*) its children are evaluated. As an example, Figure 4.8 represents the same directory structure as before, with the numbers in parentheses representing the number of disk blocks taken up by each file.

unsigned int

size_directory( Directory_or_file D )

{

unsigned int total_size;

/*1*/ total_size = 0;

/*2*/ if( D is a legitimate entry)

{

/*3*/ total_size = file_size( D );

/*4*/ if( D is a directory )

/*5*/ for each child, c, of D

/*6*/ total_size += size_directory( c );

}

/*7*/ return( total_size );

}

ch1.r 3

ch2.r 2

ch3.r 4

book 10

syl.r 1

fall88 2

syl.r 5

spr89 6

syl.r 2

sum89 3

cop3530 12

course 13

junk.c 6

mark 30

junk.c 8

alex 9

work 1

grades 3

prog1.r 4

prog2.r 1

fall88 9

prog2.r 2

prog1.r 7

grades 9

fall89 19

cop3212 29

course 30

bill 32

/usr 72

A binary tree is a tree in which no node can have more than two children.

Because a binary tree has at most two children, we can keep direct pointers to them. The declaration of tree nodes is similar in structure to that for doubly linked lists, in that a node is a structure consisting of the *key* information plus two pointers (*left* and *right*) to other nodes (see

typedef struct tree_node *tree_ptr;

struct tree_node

{

element_type element;

tree_ptr left;

tree_ptr right;

};

typedef tree_ptr TREE;

Figure 4.14 shows an example of an *expression tree*. The leaves of an expression tree are *operands*, such as constants or variable names, and the other nodes contain *operators*. This particular tree happens to be binary, because all of the operations are binary, and although this is the simplest case, it is possible for nodes to have more than two children. It is also possible for a node to have only one child, as is the case with the *unary minus *operator. We can evaluate an expression tree, *T*, by applying the operator at the root to the values obtained by recursively evaluating the left and right subtrees. In our example, the left subtree evaluates to *a + *(*b * c*) and the right subtree evaluates to ((*d *e*)* *+* f *)**g. *The entire tree therefore represents (*a *+* *(*b*c*)) + (((*d * e*) +* f*)** g*).

We can produce an (overly parenthesized) infix expression by recursively producing a parenthesized left expression, then printing out the operator at the root, and finally recursively producing a parenthesized right expression. This general strattegy ( left, node, right ) is known as an *inorder* traversal; it is easy to remember because of the type of expression it produces.

An alternate traversal strategy is to recursively print out the left subtree, the right subtree, and then the operator. If we apply this strategy to our tree above, the output is a b c * + d e * f + g * +, which is easily seen to be the postfix representation of Section 3.3.3. This traversal strategy is generally known as a *postorder* traversal. We have seen this traversal strategy earlier in Section 4.1.

A third traversal strategy is to print out the operator first and then recursively print out the left and right subtrees. The resulting expression, + + a * b c * + * d e f g, is the less useful *prefix* notation and the traversal strategy is a *preorder* traversal, which we have also seen earlier in Section 4.1. We will return to these traversal strategies once again later in the chapter.

As an example, suppose the input is

a b + c d e + * *

*For convenience, we will have the stack grow from left to right in the diagrams.

Now a '+' is read, so two trees are merged.

Continuing, a '*' is read, so we pop two tree pointers and form a new tree with a '*' as root.

An important application of binary trees is their use in searching. Let us assume that each node in the tree is assigned a key value. In our examples, we will assume for simplicity that these are integers, although arbitrarily complex keys are allowed. We will also assume that all the keys are distinct, and deal with duplicates later.

typedef struct tree_node *tree_ptr;

struct tree_node

{

element_type element;

tree_ptr left;

tree_ptr right;

};

typedef tree_ptr SEARCH_TREE;

This operation generally requires returning a pointer to the node in tree *T* that has key *x*, or *NULL* if there is no such node. The structure of the tree makes this simple. If *T *is *, then we can just return *. Otherwise, if the key stored at *T* is *x*, we can return *T*. Otherwise, we make a recursive call on a subtree of *T*, either left or right, depending on the relationship of *x* to the key stored in *T*. The code in Figure 4.18 is an implementation of this strategy.

SEARCH_TREE

make_null ( void )

{

return NULL;

}

tree_ptr

find( element_type x, SEARCH_TREE T )

{

if( T == NULL )

return NULL;

if( x<T->element )

return( find( x, T->left ) );

else

if( x>T->element )

return( find( x, T->right ) );

else

return T;

}

tree_ptr

find_min( SEARCH_TREE T )

{

if( T == NULL )

return NULL;

else

if( T->left == NULL )

return( T );

else

return( find_min ( T->left ) );

}

tree_ptr

find_max( SEARCH_TREE T )

{

if( T != NULL )

while( T->right != NULL )

T = T->right;

return T;

}

The insertion routine is conceptually simple. To insert *x* into tree *T*, proceed down the tree as you would with a *find*. If *x* is found, do nothing (or "update" something). Otherwise, insert *x* at the last spot on the path traversed. Figure 4.21 shows what happens. To insert 5, we traverse the tree as though a *find* were occurring. At the node with key 4, we need to go right, but there is no subtree, so 5 is not in the tree, and this is the correct spot.

Duplicates can be handled by keeping an extra field in the node record indicating the frequency of occurrence. This adds some extra space to the entire tree, but is better than putting duplicates in the tree (which tends to make the tree very deep). Of course this strategy does not work if the key is only part of a larger record. If that is the case, then we can keep all of the records that have the same key in an auxiliary data structure, such as a list or another search tree.

Figure 4.22 shows the code for the insertion routine. Since *T* points to the root of the tree, and the root changes on the first insertion, *insert* is written as a function that returns a pointer to the root of the new tree. Lines 8 and 10 recursively insert and attach *x* into the appropriate subtree.

tree_ptr

insert( element_type x, SEARCH_TREE T )

{

/*1*/ if( T == NULL )

{ /* Create and return a one-node tree */

/*2*/ T = (SEARCH_TREE) malloc ( sizeof (struct tree_node) );

/*3*/ if( T == NULL )

/*4*/ fatal_error("Out of space!!!");

else

{

/*5*/ T->element = x;

/*6*/ T->left = T->right = NULL;

}

}

else

/*7*/ if( x<T->element )

/*8*/ T->left = insert( x, T->left );

else

/*9*/ if( x>T->element )

/*10*/ T->right = insert( x, T->right );

/* else x is in the tree already. We'll do nothing */

/*11*/ return T; /* Don't forget this line!! */

}

As is common with many data structures, the hardest operation is deletion. Once we have found the node to be deleted, we need to consider several possibilities.

If the node is a leaf, it can be deleted immediately. If the node has one child, the node can be deleted after its parent adjusts a pointer to bypass the node (we will draw the pointer directions explicitly for clarity). See Figure 4.23. Notice that the deleted node is now unreferenced and can be disposed of only if a pointer to it has been saved.

The complicated case deals with a node with two children. The general strategy is to replace the key of this node with the smallest key of the right subtree (which is easily found) and recursively delete that node (which is now empty). Because the smallest node in the right subtree cannot have a left child, the second *delete* is an easy one. Figure 4.24 shows an initial tree and the result of a deletion. The node to be deleted is the left child of the root; the key value is 2. It is replaced with the smallest key in its right subtree (3), and then that node is deleted as before.

The code in Figure 4.25 performs deletion. It is inefficient, because it makes two passes down the tree to find and delete the smallest node in the right subtree when this is appropriate. It is easy to remove this inefficiency, by writing a special *delete_min* function, and we have left it in only for simplicity.

tree_ptr

delete( element_type x, SEARCH_TREE T )

{

tree_ptr tmp_cell, child;

if( T == NULL )

error("Element not found");

else

if( x<T->element ) /* Go left */

T->left = delete( x, T->left );

else

if( x>T->element ) /* Go right */

T->right = delete( x, T->right );

else /* Found element to be deleted */

if( T->left && T->right ) /* Two children */

{ /* Replace with smallest in right subtree */

tmp_cell = find_min( T->right );

T->element = tmp_cell->element;

T->right = delete( T->element, T->right );

}

else /* One child */

}

tmp_cell = T;

if( T->left == NULL ) /* Only a right child */

child = T->right;

if( T->right == NULL ) /* Only a left child */

child = T->left;

free( tmp_cell );

return child;

}

return T;

}

Intuitively, we expect that all of the operations of the previous section, except *make_null*, should take *O*(log *n*) time, because in constant time we descend a level in the tree, thus operating on a tree that is now roughly half as large. Indeed, the running time of all the operations, except *make_null*, is *O*(*d*), where *d* is the depth of the node containing the accessed key.

D(n) =D(i) +D(n-i-1) +n-1

This recurrence will be encountered and solved in Chapter 7, obtaining an average value of *D*(*n*) = *O*(*n* log *n*). Thus, the expected depth of any node is *O*(log *n*). As an example, the randomly generated 500-node tree shown in Figure 4.26 has nodes at expected depth 9.98.

It is tempting to say immediately that this result implies that the average running time of all the operations discussed in the previous section is *O*(log *n*), but this is not entirely true. The reason for this is that because of deletions, it is not clear that all binary search trees are equally likely. In particular, the deletion algorithm described above favors making the left subtrees deeper than the right, because we are always replacing a deleted node with a node from the right subtree. The exact effect of this strategy is still unknown, but it seems only to be a theoretical novelty. It has been shown that if we alternate insertions and deletions (*n*^{2}) times, then the trees will have an expected depth of . After a quarter-million random *insert/delete* pairs, the tree that was somewhat right-heavy in Figure 4.26 looks decidedly unbalanced (average depth = 12.51). See Figure 4.27.

If the input comes into a tree presorted, then a series of *inserts* will take quadratic time and give a very expensive implementation of a linked list, since the tree will consist only of nodes with no left children. One solution to the problem is to insist on an extra structural condition called *balance:* no node is allowed to get too deep.

A second, newer, method is to forego the balance condition and allow the tree to be arbitrarily deep, but after every operation, a restructuring rule is applied that tends to make future operations efficient. These types of data structures are generally classified as *self-adjusting*. In the case of a binary search tree, we can no longer guarantee an *O*(log *n*) bound on any single operation, but can show that any *sequence* of *m* operations takes total time *O*(*m *log *n*) in the worst case. This is generally sufficient protection against a bad worst case. The data structure we will discuss is known as a *splay tree*; its analysis is fairly intricate and is discussed in Chapter 11.

An AVL (Adelson-Velskii and Landis) tree is a binary search tree with a *balance* condition. The balance condition must be easy to maintain, and it ensures that the depth of the tree is *O*(log *n*). The simplest idea is to require that the left and right subtrees have the same height. As Figure 4.28 shows, this idea does not force the tree to be shallow.

An AVL tree is identical to a binary search tree, except that for every node in the tree, the height of the left and right subtrees can differ by at most 1. (The height of an empty tree is defined to be -1.) In Figure 4.29 the tree on the left is an AVL tree, but the tree on the right is not. Height information is kept for each node (in the node structure). It is easy to show that the height of an AVL tree is at most roughly 1.44 log(*n* + 2) - .328, but in practice it is about log(*n* + 1) + 0.25 (although the latter claim has not been proven). As an example, the AVL tree of height 9 with the fewest nodes (143) is shown in Figure 4.30. This tree has as a left subtree an AVL tree of height 7 of minimum size. The right subtree is an AVL tree of height 8 of minimum size. This tells us that the minimum number of nodes, *N*(*h*), in an AVL tree of height *h* is given by *N*(*h*) = *N*(*h* -1) + *N*(*h* - 2) + 1. For *h* = 0, *N*(*h*) = 1. For *h* = 1, *N*(*h*) = 2. The function *N*(*h*) is closely related to the Fibonacci numbers, from which the bound claimed above on the height of an AVL tree follows.

Thus, all the tree operations can be performed in *O*(log *n*) time, except possibly insertion (we will assume lazy deletion). When we do an insertion, we need to update all the balancing information for the nodes on the path back to the root, but the reason that insertion is potentially difficult is that inserting a node could violate the AVL tree property. (For instance, inserting into the AVL tree in Figure 4.29 would destroy the balance condition at the node with key 8.) If this is the case, then the property has to be restored before the insertion step is considered over. It turns out that this can always be done with a simple modification to the tree, known as a rotation. We describe rotations in the following section.

The two trees in Figure 4.31 contain the same elements and are both binary search trees. First of all, in both trees *k*_{1} < *k*_{2}. Second, all elements in the subtree *X* are smaller than *k*_{1} in both trees. Third, all elements in subtree *Z* are larger than *k*_{2}. Finally, all elements in subtree *Y* are in between *k*_{1} and *k*_{2}. The conversion of one of the above trees to the other is known as a *rotation*. A rotation involves only a few pointer changes (we shall see exactly how many later), and changes the structure of the tree while preserving the search tree property.

The rotation does not have to be done at the root of a tree; it can be done at any node in the tree, since that node is the root of some subtree. It can transform either tree into the other. This gives a simple method to fix up an AVL tree if an insertion causes some node in an AVL tree to lose the balance property: Do a rotation at that node. The basic algorithm is to start at the node inserted and travel up the tree, updating the balance information at every node on the path. If we get to the root without having found any badly balanced nodes, we are done. Otherwise, we do a rotation at the first bad node found, adjust its balance, and are done (we do not have to continue going to the root). In many cases, this is sufficient to rebalance the tree. For instance, in Figure 4.32, after the insertion of the in the original AVL tree on the left, node 8 becomes unbalanced. Thus, we do a single rotation between 7 and 8, obtaining the tree on the right.

The algorithm described in the preceding paragraphs has one problem. There is a case where the rotation does not fix the tree. Continuing our example, suppose we insert keys 8 through 15 in reverse order. Inserting 15 is easy, since it does not destroy the balance property, but inserting 14 causes a height imbalance at node 7.

As the diagram shows, the single rotation has not fixed the height imbalance. The problem is that the height imbalance was caused by a node inserted into the tree containing the middle elements (tree *Y* in Fig. 4.31) at the same time as the other trees had identical heights. The case is easy to check for, and the solution is called a *double rotation*, which is similar to a single rotation but involves four subtrees instead of three. In Figure 4.33, the tree on the left is converted to the tree on the right. By the way, the effect is the same as rotating between *k*_{1} and *k*_{2} and then between *k*_{2} and *k*_{3}. There is a symmetric case, which is also shown (see Fig. 4.34).

Next we insert 13, which requires a double rotation. Here the double rotation is again a right-left double rotation that will involve 6, 14, and 7 and will restore the tree. In this case, *k*_{3} is the node with key 6, *k*_{1} is the node with key 14, and *k*_{2 }is the node with key 7. Subtree *A* is the tree rooted at the node with key 5, subtree *B* is the empty subtree that was originally the left child of the node with key 7, subtree *C* is the tree rooted at the node with key 13, and finally, subtree* D* is the tree rooted at the node with key 15.

Insertion of 11 will require a single rotation:

Another efficiency issue concerns storage of the height information. Since all that is really required is the difference in height, which is guaranteed to be small, we could get by with two bits (to represent +1, 0, -1) if we really try. Doing so will avoid repetitive calculation of balance factors but results in some loss of clarity. The resulting code is somewhat more complicated than if the height were stored at each node. If a recursive routine is written, then speed is probably not the main consideration. In this case, the slight speed advantage obtained by storing balance factors hardly seems worth the loss of clarity and relative simplicity. Furthermore, since most machines will align this to at least an 8-bit boundary anyway, there is not likely to be any difference in the amount of space used. Eight bits will allow us to store absolute heights of up to 255. Since the tree is balanced, it is inconceivable that this would be insufficient (see the exercises).

With all this, we are ready to write the AVL routines. We will do only a partial job and leave the rest as an exercise. First, we need the declarations. These are given in Figure 4.35. We also need a quick function to return the height of a node. This function is necessary to handle the annoying case of a *NULL* pointer. This is shown in Figure 4.36. The basic insertion routine is easy to write, since it consists mostly of function calls (see Fig. 4.37).

typedef struct avl_node *avl_ptr;

struct avl_node

{

element_type element;

avl_ptr left;

avl_ptr right;

int height;

};

typedef avl_ptr SEARCH_TREE;

int

height( avl_ptr p )

{

if( p == NULL )

return -1;

else

return p->height;

}

For the trees in Figure 4.38, *s_rotate_left* converts the tree on the left to the tree on the right, returning a pointer to the new root. *s_rotate_right* is symmetric. The code is shown in Figure 4.39.

The last function we will write will perform the double rotation pictured in Figure 4.40, for which the code is shown in Figure 4.41.

Deletion in AVL trees is somewhat more complicated than insertion. Lazy deletion is probably the best strategy if deletions are relatively infrequent.

We now describe a relatively simple data structure, known as a *splay tree*, that guarantees that any *m* consecutive tree operations take at most *O*(*m *log *n*) time. Although this guarantee does not preclude the possibility that any *single* operation might take *O*(*n*) time, and thus the bound is not as strong as an *O*(log *n*) worst-case bound per operation, the net effect is the same: There are no bad input sequences. Generally, when a sequence of *m* operations has total worst-case running time of *O*(*m f*(*n*)), we say that the *amortized* running time is *O*(*f*(*n*)). Thus, a splay tree has *O*(log *n*) amortized cost per operation. Over a long sequence of operations, some may take more, some less.

SEARCH_TREE

insert( element_type x, SEARCH_TREE T )

{

return insert1( x, T, NULL );

}

SEARCH_TREE

insert1( element_type x, SEARCH_TREE T, avl_ptr parent )

{

avl_ptr rotated_tree;

if( T == NULL )

{ /* Create and return a one-node tree */

T = (SEARCH_TREE) malloc ( sizeof (struct avl_node) );

if( T == NULL )

fatal_error("Out of space!!!");

else

{

T->element = x; T->height = 0;

T->left = T->right = NULL;

}

}

else

{

if( x<T->element )

{

T->left = insert1( x, T->left, T );

if( ( height( T->left ) - height( T->right ) ) == 2

{

if( x<T->left->element )

rotated_tree = s_rotate_left( T );

else

rotated_tree = d_rotate_left( T );

if( parent->left == T )

parent->left = rotated_tree;

else

parent->right = rotated_tree;

}

else

T->height = max( height(T->left), height(T->right) ) + 1;

}

else

/* Symmetric Case for right subtree */;

/* Else x is in the tree already. We'll do nothing */

}

return T;

}

/* This function can be called only if k2 has a left child. */

/* Perform a rotate between a node (k2) and its left child. */

/* Update heights. */

/* Then return new root. */

avl_ptr

s_rotate_left( avl_ptr k2 )

{

avl_ptr k1;

k1 = k2->left;

k2->left = k1->right;

k1->right = k2;

k2->height = max( height(k2->left), height(k2->right) ) + 1;

k1->height = max( height(k1->left), k2->height ) + 1;

return k1; /* New root */

}

/* This function can be called only if k3 has a left child */

/* and k3's left child has a right child */

/* Do the left-right double rotation. Update heights */

avl_ptr

d_rotate_left( avl_ptr k3 )

{

/* rotate between k1 and k2 */

k3->left = s_rotate_right( k3->left );

/* rotate between k3 and k2 */

return( s_rotate_left( k3 ) );

}

The basic idea of the splay tree is that after a node is accessed, it is pushed to the root by a series of AVL tree rotations. Notice that if a node is deep, there are many nodes on the path that are also relatively deep, and by restructuring we can make future accesses cheaper on all these nodes. Thus, if the node is unduly deep, then we want this restructuring to have the side effect of balancing the tree (to some extent). Besides giving a good time bound in theory, this method is likely to have practical utility, because in many applications when a node is accessed, it is likely to be accessed again in the near future. Studies have shown that this happens much more often than one would expect. Splay trees also do not require the maintenance of height or balance information, thus saving space and simplifying the code to some extent (especially when careful implementations are written).

4.5.1. A Simple Idea (That Does Not Work)

One way of performing the restructuring described above is to perform single rotations, bottom up. This means that we rotate every node on the access path with its parent. As an example, consider what happens after an access (a *find*) on *k*_{1} in the following tree.

Then, we rotate between *k*_{1} and *k*_{3}, obtaining the next tree.

Then two more rotations are performed until we reach the root.

The splaying strategy is similar to the rotation idea above, except that we are a little more selective about how rotations are performed. We will still rotate bottom up along the access path. Let *x* be a (nonroot) node on the access path at which we are rotating. If the parent of *x* is the root of the tree, we merely rotate *x* and the root. This is the last rotation along the access path. Otherwise, *x* has both a parent (*p*) and a grandparent (*g*), and there are two cases, plus symmetries, to consider. The first case is the *zig-zag* case (see Fig. 4.42). Here *x* is a right child and *p* is a left child (or vice versa). If this is the case, we perform a double rotation, exactly like an AVL double rotation. Otherwise, we have a *zig-zig* case: *x* and *p* are either both left children or both right children. In that case, we transform the tree on the left of Figure 4.43 to the tree on the right.

As an example, consider the tree from the last example, with a *find* on *k*_{1}:

To see the difference that splaying makes over simple rotation, consider again the effect of inserting keys 1, 2, 3, . . . , *n* into an initially empty tree. This takes a total of *O*(*n*), as before, and yields the same tree as simple rotations. Figure 4.44 shows the result of splaying at the node with key 1. The difference is that after an access of the node with key 1, which takes *n* -1 units, the access on the node with key 2 will only take about *n*/2 units instead of *n* - 2 units; there are no nodes quite as deep as before.

An access on the node with key 2 will bring nodes to within *n*/4 of the root, and this is repeated until the depth becomes roughly log *n* (an example with *n* = 7 is too small to see the effect well). Figures 4.45 to 4.53 show the result of accessing keys 1 through 9 in a 32-node tree that originally contains only left children. Thus we do not get the same bad behavior from splay trees that is prevalent in the simple rotation strategy. (Actually, this turns out to be a very good case. A rather complicated proof shows that for this example, the *n* accesses take a total of *O*(*n*) time).

These figures show off the fundamental and crucial property of splay trees. When access paths are long, thus leading to a longer-than-normal search time, the rotations tend to be good for future operations. When accesses are cheap, the rotations are not as good and can be bad. The extreme case is the initial tree formed by the insertions. All the insertions were constant-time operations leading to a bad initial tree. At that point in time, we had a very bad tree, but we were running ahead of schedule and had the compensation of less total running time. Then a couple of really horrible accesses left a nearly balanced tree, but the cost was that we had to give back some of the time that had been saved. The main theorem, which we will prove in Chapter 11, is that we never fall behind a pace of *O*(log *n*) per operation: We are always on schedule, even though there are occasionally bad operations.

The type declarations (Fig. 4.54) are simple to understand. The splaying routine (Fig. 4.55) takes as argument the last node on the accessed path and makes it the new root. The routines *single_rotate* and *double_rotate* choose the correct type of rotation. We provide the code for *single_rotate* in Figure 4.56.

The rotation routines are similar to the AVL rotations, except that the parent pointers must be maintained. Some sample routines are in the figures that follow. Since *zig* rotations always make *x* the new root, we know that *x* will have no parent after the operation. The code for this is in Figure 4.57.

*Zig-zigs* and *Zig-zags* are similar. We will write the one routine to perform the *zig-zig* splay when both *x* and *p* are left children. One way to do this is to write a *single_rotate* routine that includes pointer changes for the parent, and then implement the complex rotations with two single rotations. This is the way we coded the AVL routines. We have taken a different approach in Figure 4.58 to show the diversity of styles available. See Figure 4.59. You should try to code the other cases yourself; it will be excellent pointer manipulation practice.

typedef struct splay_node *splay_ptr;

struct splay_node

{

element_type element;

splay_ptr left;

splay-ptr right;

splay-ptr parent;

};

typedef splay_ptr SEARCH_TREE;

void

splay( splay_ptr current )

{

splay_ptr father;

father = current->parent;

while( father != NULL )

{

if( father->parent == NULL )

single_rotate (current );

else

double_rotate( current );

father = current->parent;

}

}

void

single_rotate( splay_ptr x )

{

if( x->parent->left == x)

zig_left( x );

else

zig_right( x );

}

void

zig_left( splay_ptr x )

{

splay ptr p, B;

p = x->parent;

B = x->right;

x->right = p; /* x's new right child is p*/

x->parent = NULL; /* x will now be a root */

if( B != NULL )

B->parent = p;

p->left = B;

p->parent = x;

}

We can perform deletion by accessing the node to be deleted. This puts the node at the root. If it is deleted, we get two subtrees *T _{L}* and

The analysis of splay trees is difficult, because it must take into account the ever-changing structure of the tree. On the other hand, splay trees are much simpler to program than AVL trees, since there are fewer cases to consider and no balance information to maintain. Our splay tree code may look complicated, but as pointed out before, it can be simplified; it is probably much simpler than a nonrecursive AVL implementation. Some empirical evidence suggests that this translates into faster code in practice, although the case for this is far from complete. Finally, we point out that there are several variations of splay trees that can perform even better in practice.

void

zig_zig_left( splay_ptr x )

{

splay_ptr p, g, B, C, ggp;

p = x->parent;

g = p->parent;

B = x->right;

C = p->right;

ggp = g->parent;

x->right = p; /* x's new right child is p*/

p->parent = x;

p->right = g; /* p's new right child is g */

g->parent = p;

if( B != NULL ) /* p's new left child is subtree B */

B->parent = p;

p->left = B;

if( C != NULL ) /* g's new left child is subtree C */

C->parent = g;

g->left = C;

x->parent = ggp; /* connect to rest of the tree */

if( ggp ! = NULL )

if( gpp->left == g )

ggp->left = x;

else

ggp->right = x;

}

Because of the ordering information in a binary search tree, it is simple to list all the keys in sorted order. The recursive procedure in Figure 4.60 does this.

void

print_tree( SEARCH_TREE T )

{

if( T != NULL )

{

print_tree( T->left );

print_element( T->element );

print_tree( T->right );

}

}

Sometimes we need to process both subtrees first before we can process a node. For instance, to compute the height of a node, we need to know the height of the subtrees first. The code in Figure 4.61 computes this. Since it is always a good idea to check the special cases - and crucial when recursion is involved - notice that the routine will declare the height of a leaf to be zero, which is correct. This general order of traversal, which we have also seen before, is known as a *postorder* traversal. Again, the total running time is *O*(*n*), because constant work is performed at each node.

The common idea in all of these routines is that you handle the *NULL* case first, and then the rest. Notice the lack of extraneous variables. These routines pass only the tree, and do not declare or pass any extra variables. The more compact the code, the less likely that a silly bug will turn up. A fourth, less often used, traversal (which we have not seen yet) is *level-order* traversal. In a level-order traveresal, all nodes at depth *d* are processed before any node at depth *d* + 1. Level-order traversal differs from the other traversals in that it is not done recursively; a queue is used, instead of the implied stack of recursion.

int

height( TREE T )

{

if( T == NULL )

return -1;

else

return ( max( height(T->left), height(T->right) ) + 1 );

}

Although all of the search trees we have seen so far are binary, there is a popular search tree that is not binary. This tree is known as a *B-tree*.

A B-tree of order *m* is a tree with the following structural properties:

The root is either a leaf or has between 2 and *m* children.

All nonleaf nodes (except the root) have between *m*/2 and *m* children.

All leaves are at the same depth.

The tree in Figure 4.62 is an example of a B-tree of order 4.

A B-tree of order 4 is more popularly known as a 2-3-4 tree, and a B-tree of order 3 is known as a 2-3 tree. We will describe the operation of B-trees by using the special case of 2-3 trees. Our starting point is the 2-3 tree that follows.

We have seen uses of trees in operating systems, compiler design, and searching. Expression trees are a small example of a more general structure known as a *parse tree*, which is a central data structure in compiler design. Parse trees are not binary, but are relatively simple extensions of expression trees (although the algorithms to build them are not quite so simple).

B-trees are balanced *m*-way (as opposed to 2-way or binary) trees, which are well suited for disks; a special case is the 2-3 tree, which is another common method of implementing balanced search trees.

A final note: By inserting elements into a search tree and then performing an inorder traversal, we obtain the elements in sorted order. This gives an *O*(*n* log *n*) algorithm to sort, which is a worst-case bound if any sophisticated search tree is used. We shall see better ways in Chapter 7, but none that have a lower time bound.

*Questions 4.1 to 4.3 refer to the tree in **Figure 4.63.*

4.1 For the tree in Figure 4.63 :

4.2 For each node in the tree of Figure 4.63 :

4.3 What is the depth of the tree in Figure 4.63?

4.4 Show that in a binary tree of *n* nodes, there are *n* + 1 * pointers representing children.*

4.5 Show that the maximum number of nodes in a binary tree of height *h* is 2* ^{h}*+1 - 1.

4.6 A *full* *node* is a node with two children. Prove that the number of full nodes plus one is equal to the number of leaves in a binary tree.

4.7 Suppose a binary tree has leaves *l*_{1}, *l*_{2}, . . . , *l _{m}* at depth

4.8 Give the prefix, infix, and postfix expressions corresponding to the tree in Figure 4.64.

4.9 a. Show the result of inserting 3, 1, 4, 6, 9, 2, 5, 7 into an initially empty binary search tree.

b. Show the result of deleting the root.

4.10 Write routines to implement the basic binary search tree operations.

4.11 Binary search trees can be implemented with cursors, using a strategy similar to a cursor linked list implementation. Write the basic binary search tree routines using a cursor implementation.

4.12 Suppose you want to perform an experiment to verify the problems that can be caused by random *insert/delete* pairs. Here is a strategy that is not perfectlyrandom, but close enough. You build a tree with *n* elements by inserting *n *elements chosen at random from the range 1 to *m = **n*. You then perform *n*^{2} pairs of insertions followed by deletions. Assume the existence of a routine, *rand_int*(*a,b*), which returns a uniform random integer between *a* and *b* inclusive.

c. What is a good choice of ? Why?

4.13 Write a program to evaluate empirically the following strategies for deleting nodes with two children:

a. Replace with the largest node, *X*, in *T _{L}* and recursively delete

4.14 ** Prove that the depth of a random binary search tree (depth of the deepest node) is *O*(log *n*), on average.

4.15 *a. Give a precise expression for the minimum number of nodes in an AVL tree of height *h*.

b. What is the minimum number of nodes in an AVL tree of height 15?

4.16 Show the result of inserting 2, 1, 4, 5, 9, 3, 6, 7 into an initially empty AVL tree.

4.17 * Keys 1, 2, . . . , 2* ^{k}* -1 are inserted in order into an initially empty AVL tree. Prove that the resulting tree is perfectly balanced.

4.18 Write the remaining procedures to implement AVL single and double rotations.

4.19 Write a nonrecursive function to insert into an AVL tree.

4.20 * How can you implement (nonlazy) deletion in AVL trees?

4.21 a. How many bits are required per node to store the height of a node in an *n*-node AVL tree?

b. What is the smallest AVL tree that overflows an 8-bit height counter?

4.22 Write the functions to perform the double rotation without the inefficiency of doing two single rotations.

4.23 Show the result of accessing the keys 3, 9, 1, 5 in order in the splay tree in Figure 4.65.

4.24 Show the result of deleting the element with key 6 in the resulting splay tree for the previous exercise.

4.25 Nodes 1 through *n* = 1024 form a splay tree of left children.

a. What is the internal path length of the tree (exactly)?

*c. If the sequence of successive finds is continued, when is the internal path length minimized?

4.26 a. Show that if all nodes in a splay tree are accessed in sequential order, the resulting tree consists of a chain of left children.

4.27 Write a program to perform random operations on splay trees. Count the total number of rotations performed over the sequence. How does the running time compare to AVL trees and unbalanced binary search trees?

4.28 Write efficient functions that take only a pointer to a binary tree, *T*, and compute

c. the number of full nodes in *T*

What is the running time of your routines?

4.29 Write a function to generate an *n*-node random binary search tree with distinct keys 1 through *n*. What is the running time of your routine?

4.30 Write a function to generate the AVL tree of height *h* with fewest nodes. What is the running time of your function?

4.31 Write a function to generate a perfectly balanced binary search tree of height *h* with keys 1 through 2* ^{h}*+1 - 1. What is the running time of your function?

4.32 Write a function that takes as input a binary search tree, *T*, and two keys *k*_{1} and *k*_{2}, which are ordered so that *k*_{1} *k*_{2}, and prints all elements *x* in the tree such that *k*_{1} *key(x*) *k*_{2}. Do not assume any information about the type of keys except that they can be ordered (consistently). Your program should run in *O*(*K* + log *n*) average time, where *K* is the number of keys printed. Bound the running time of your algorithm.

4.33 The larger binary trees in this chapter were generated automatically by a program. This was done by assigning an (*x, y*) coordinate to each tree node, drawing a circle around each coordinate (this is hard to see in some pictures), and connecting each node to its parent. Assume you have a binary search tree stored in memory (perhaps generated by one of the routines above) and that each node has two extra fields to store the coordinates.

4.34 Write a general-purpose tree-drawing program that will convert a tree into the following graph-assembler instructions:

4.35 Write a routine to list out the nodes of a binary tree in *level-order*. List the root, then nodes at depth 1, followed by nodes at depth 2, and so on. You must do this in linear time. Prove your time bound.

4.36 a. Show the result of inserting the following keys into an initially empty 2-3 tree: 3, 1, 4, 5, 9, 2, 6, 8, 7, 0.

b. Show the result of deleting 0 and then 9 from the 2-3 tree created in part (a).

4.37 *a. Write a routine to perform insertion from a B-tree.

4.38 A *B**-*tree* of order *m* is a B-tree in which each each interior node has between 2*m*/3 and *m* children. Describe a method to perform insertion into a B*-tree.

4.39 Show how the tree in Figure 4.66 is represented using a child/sibling pointer implementation.

4.40 Write a procedure to traverse a tree stored with child/sibling links.

4.41 Two binary trees are similar if they are both empty or both nonempty and have similar left and right subtrees. Write a function to decide whether two binary trees are similar. What is the running time of your program?

4.42 Two trees, *T*_{1} and *T*_{2}, are *isomorphic* if *T*_{1} can be transformed into *T*_{2} by swapping left and right children of (some of the) nodes in *T*_{1}. For instance, the two trees in Figure 4.67 are isomorphic because they are the same if the children of *A*, *B*, and *G*, but not the other nodes, are swapped.

a. Give a polynomial time algorithm to decide if two trees are isomorphic.

*b. What is the running time of your program (there is a linear solution)?

4.43 *a. Show that via AVL single rotations, any binary search tree* T*_{1} can be transformed into another search tree *T*_{2} (with the same keys).

*b. Give an algorithm to perform this transformation using *O*(*n *log *n*) rotations on average.

**c. Show that this transformation can be done with *O*(*n*) rotations, worst-case.

4.44 Suppose we want to add the operation *find_kth* to our repertoire. The operation *find_kth(T,i)* returns the element in tree *T* with *i ^{th}* smallest key. Assume all elements have distinct keys. Explain how to modify the binary search tree to support this operation in

4.45 Since a binary search tree with *n* nodes has *n* + 1 * pointers, half the space allocated in a binary search tree for pointer information is wasted. Suppose that if a node has a * left child, we make its left child point to its inorder predecessor, and if a node has a * right child, we make its right child point to its inorder successor. This is known as a *threaded tree* and the extra pointers are called *threads*.*

a. How can we distinguish threads from real children pointers?

c. What is the advantage of using threaded trees?

4.46 A binary search tree presupposes that searching is based on only one key per record. Suppose we would like to be able to perform searching based on either of two keys, *key*_{1} or *key*_{2}.

b. An alternative method is a 2-d *tree*. A 2-d tree is similar to a binary search tree, except that branching at even levels is done with respect to *key*_{1}, and branching at odd levels is done with *key*_{2}. Figure 4.68 shows a 2-d tree, with the first and last names as keys, for post-WWII presidents. The presidents' names were inserted chronologically (Truman, Eisenhower, Kennedy, Johnson, Nixon, Ford, Carter, Reagan, Bush). Write a routine to perform insertion into a 2-d tree.

c. Write an efficient procedure that prints all records in the tree that simultaneously satisfy the constraints *low*_{1} *key*_{1} *high*_{1} and *low*_{2} *key*_{2} *high*_{2}.

d. Show how to extend the 2-d tree to handle more than two search keys. The resulting strategy is known as a *k*-d tree.

More information on binary search trees, and in particular the mathematical properties of trees can be found in the two books by Knuth [23] and [24].

Several papers deal with the lack of balance caused by biased deletion algorithms in binary search trees. Hibbard's paper [20] proposed the original deletion algorithm and established that one deletion preserves the randomness of the trees. A complete analysis has been performed only for trees with three [21] and four nodes[5]. Eppinger's paper [15] provided early empirical evidence of nonrandomness, and the papers by Culberson and Munro, [11], [12], provide some analytical evidence (but not a complete proof for the general case of intermixed insertions and deletions).

AVL trees were proposed by Adelson-Velskii and Landis [1]. Simulation results for AVL trees, and variants in which the height imbalance is allowed to be at most *k* for various values of *k*, are presented in [22]. A deletion algorithm for AVL trees can be found in [24]. Analysis of the averaged depth of AVL trees is incomplete, but some results are contained in [25].

[3] and [9] considered self-adjusting trees like the type in Section 4.5.1. Splay trees are described in [29].

B-trees first appeared in [6]. The implementation described in the original paper allows data to be stored in internal nodes as well as leaves. The data structure we have described is sometimes known as a B^{+ }tree. A survey of the different types of B-trees is presented in [10]. Empirical results of the various schemes is reported in [18]. Analysis of 2-3 trees and B-trees can be found in [4], [14], and [33].

Exercise 4.14 is deceptively difficult. A solution can be found in [16]. Exercise 4.26 is from [32]. Information on B*-trees, described in Exercise 4.38, can be found in [13]. Exercise 4.42 is from [2]. A solution to Exercise 4.43 using 2*n* -6 rotations is given in [30]. Using threads, a la Exercise 4.45, was first proposed in [28]. *k*-d trees were first proposed in [7]. Their major drawback is that both deletion and balancing are difficult. [8] discusses *k*-*d* trees and other methods used for multidimensional searching.

Other popular balanced search trees are red-black trees [19] and weight-balanced trees [27]. More balanced tree schemes can be found in the books [17], [26], and [31].

1. G. M. Adelson-Velskii and E. M. Landis, "An Algorithm for the Organization of Information," *Soviet Math. Doklady* 3 (1962), 1259-1263.

2. A. V. Aho, J. E. Hopcroft, and J. D. Ullman, *The Design and Analysis of Computer Algorithms*, Addison-Wesley, Reading, MA, 1974.

3. B. Allen and J. I. Munro, "Self Organizing Search Trees," *Journal of the ACM*, 25 (1978), 526-535.

4. R. A. Baeza-Yates, "Expected Behaviour of B^{+}- trees under Random Insertions," *Acta Informatica* 26 (1989), 439-471.

5. R. A. Baeza-Yates, "A Trivial Algorithm Whose Analysis Isn't: A Continuation," *BIT *29 (1989), 88-113.

6. R. Bayer and E. M. McGreight, "Organization and Maintenance of Large Ordered Indices," *Acta Informatica* 1 (1972), 173-189.

7. J. L. Bentley, "Multidimensional Binary Search Trees Used for Associative Searching," *Communications of the ACM* 18 (1975), 509-517.

8. J. L. Bentley and J. H. Friedman, "Data Structures for Range Searching," *Computing Surveys* 11 (1979), 397-409.

9. J. R. Bitner, "Heuristics that Dynamically Organize Data Structures," *SIAM Journal on Computing* 8 (1979), 82-110.

10. D. Comer, "The Ubiquitous B-tree," *Computing Surveys* 11 (1979), 121-137.

11. J. Culberson and J. I. Munro, "Explaining the Behavior of Binary Search Trees under Prolonged Updates: A Model and Simulations," *Computer Journal* 32 (1989), 68-75.

12. J. Culberson and J. I. Munro, "Analysis of the Standard Deletion Algorithms' in Exact Fit Domain Binary Search Trees," *Algorithmica* 5 (1990) 295-311.

13. K. Culik, T. Ottman, and D. Wood, "Dense Multiway Trees," *ACM Transactions on Database Systems* 6 (1981), 486-512.

14. B. Eisenbath, N. Ziviana, G. H. Gonnet, K. Melhorn, and D. Wood, "The Theory of Fringe Analysis and its Application to 2-3 Trees and B-trees," *Information and Control *55 (1982), 125-174.

15. J. L. Eppinger, "An Empirical Study of Insertion and Deletion in Binary Search Trees," *Communications of the ACM* 26 (1983), 663-669.

16. P. Flajolet and A. Odlyzko, "The Average Height of Binary Trees and Other Simple Trees," *Journal of Computer and System Sciences* 25 (1982), 171-213.

17. G. H. Gonnet and R. Baeza-Yates, *Handbook of Algorithms and Data Structures*, second edition, Addison-Wesley, Reading, MA, 1991.

18. E. Gudes and S. Tsur, "Experiments with B-tree Reorganization," *Proceedings of ACM SIGMOD Symposium on Management of Data* (1980), 200-206.

19. L. J. Guibas and R. Sedgewick, "A Dichromatic Framework for Balanced Trees," *Proceedings of the Nineteenth Annual IEEE Symposium on Foundations of Computer Science* (1978), 8-21.

20. T. H. Hibbard, "Some Combinatorial Properties of Certain Trees with Applications to Searching and Sorting," *Journal of the ACM* 9 (1962), 13-28.

21. A. T. Jonassen and D. E. Knuth, "A Trivial Algorithm Whose Analysis Isn't," *Journal of Computer and System Sciences* 16 (1978), 301-322.

22. P. L. Karlton, S. H. Fuller, R. E. Scroggs, and E. B. Kaehler, "Performance of Height Balanced Trees," *Communications of the ACM* 19 (1976), 23-28.

23. D. E. Knuth, *The Art of Computer Programming: Volume 1: Fundamental Algorithms*, second edition, Addison-Wesley, Reading, MA, 1973.

24. D. E. Knuth, *The Art of Computer Programming: Volume 3: Sorting and Searching*, second printing, Addison-Wesley, Reading, MA, 1975.

25. K. Melhorn, "A Partial Analysis of Height-Balanced Trees under Random Insertions and Deletions," *SIAM Journal of Computing* 11 (1982), 748-760.

26. K. Melhorn, *Data Structures and Algorithms 1: Sorting and Searching*, Springer-Verlag, Berlin, 1984.

27. J. Nievergelt and E. M. Reingold, "Binary Search Trees of Bounded Balance," *SIAM Journal on Computing* 2 (1973), 33-43.

28. A. J. Perlis and C. Thornton, "Symbol Manipulation in Threaded Lists," *Communications of the ACM* 3 (1960), 195-204.

29. D. D. Sleator and R. E. Tarjan, "Self-adjusting Binary Search Trees," *Journal of ACM *32 (1985), 652-686.

30. D. D. Sleator, R. E. Tarjan, and W. P. Thurston, "Rotation Distance, Triangulations, and Hyperbolic Geometry," *Journal of AMS* (1988), 647-682.

31. H. F. Smith, *Data Structures-Form and Function*, Harcourt Brace Jovanovich, 1987.

32. R. E. Tarjan, "Sequential Access in Splay Trees Takes Linear Time," *Combinatorica* 5 (1985), 367-378.

33. A. C. Yao, "On Random 2-3 trees," *Acta Informatica* 9 (1978), 159-170.