*Considers how collections of information may be stored to allow**efficient searches**efficient traversals**Develops efficient in-memory sorting algorithms**to illustrate the principles of good design in algorithm development**to show how sorting is used to facilitate searching**Worst-case and average times are developed for**linear search**binary search**maximum entry sort**bubble sort**insertion sort**heapsort**quicksort**Illustrates the use of simulation in analyzing the execution time of a program*

So far in this book, arrays, lists, binary trees, and trees--the basic data sructures--have been introduced and applied to a variety of problems. Selecting the appropriate data structure requires knowing what operations are to be performed on it. The frequency with which these operations are to be performed, the characteristics of the data to be stored, and the average and worst-case time and storage requirements all combine to determine this choice. Other considerations, such as clarity and adaptability, may also serve as criteria for deciding which data structures to use. It is rarely possible to pick a data structure that is best for all criteria. Instead, trade-offs must be made.

In the case study of family relationships in Section 6.5, one of the important operations was a search of the nametable data base. At that time no decisions were made as to how the nametable data base would be structured to facilitate the search. This chapter and the next will show that manyimplementations are possible and that the appropriate choice is determined by the kinds of operations to be performed on the data base. Nametable might be organized in many ways. It can be implemented as an array, a list, a binary search tree, a balanced binary search tree, or a hash table. The techniques used to create and search these structures are introduced in this and the next chapter.

Searching and sorting masses of data to retrieve specific information and organize it for manipulation and presentation are nothing new. These basic operations were performed and studied before the advent of computers.Volumes have been written on these topics, yet they are still being researched. Even the general public and the press are evidently interested in searching, as you can see from the following letter and response:

Why is it that whenever I lose anything, it is always

Because when you find it you stop looking--and

that's the last place you looked.

Normally, the objects stored, searched for, and sorted are records. In sorting, one field of the records, called the * sort key*, is chosen as the basis for the sort. Thus, in a payroll application, an employee name field might be the key for an alphabetical ordering, while a salary field might serve as the basis for a sort to determine an ordering by amount of earnings. At times the sort key can be a combination of fields. As an example, we may desire to sort with name as the primary key and zip code as the secondary key. For simplicity it will be assumed in this chapter that the objects being sorted are integer numbers. However, any variable type could have been assumed. For sorting purposes, the only requirement is that, given any two objects, it is possible to tell which precedes the other.

Searching normally involves comparing records to a given * search key* value. The goal is to determine whether there exists among the records of the data base a record with key equal to that of the search key. If the key value can be matched, then a pointer to the record with the matching value must be returned. The basic principle for shortening search times is to organize the data base so that the search can be focused quickly on the part of the data base that will contain the desired record, if it is present.

The chapter concludes with a summary of the strengths and weaknesses of each approach introduced. Knuth [1973b] and Wirth [1976] discuss many searching and sorting algorithms that are not dealt with in this text. Knuth also contains mathematical analyses of many of these algorithms.

Three simple searches are applicable when records are stored in arrays: (1) linear search, (2) binary search, and (3) interpolation search. You probably have all used these methods in everyday activities. Understanding their limitations and how they work will help you see the need for the more complex structures and algorithms of later sections.

A * linear search* is so named because it proceeds in sequence, linearly through an array. Suppose data in Figure 8.1 is an array of integers.* The task is to determine whether the number stored in the variable key is in the array, and if so, what its index is.

One obvious method would be to traverse the data array, checking each element of data to see if it contains the same value as key. If an element currently being processed did contain the same value as key, its index would simply be returned, and the search would be complete. Arriving at the end of data without having found the key value would mean that no element with that value is stored in the array. This procedure is easily implemented using a loop structure (as in the following function). The worst-case time required for its execution will be of the order *n*, written *O*(*n*), where *n* is the length of the data array. In particular, if the value being sought is in element *i* of the array, it will take *O*(*i*) time to find it. The linear search is implemented as follows:

#define MAX 100

typedef struct

{

int key;

}record;

typedef record recordarray[MAX];

linearsearch(data,n,key,pfound,ploc)

/* Searches the n records stored in

the array data for the search key.

Found will be true only if a record

is present with key field equal to

the search key value and then loc

will contain its array index.

*/

recordarray data;

int n,key,*pfound,*ploc;

{

*ploc = 1;

while ((*ploc <= n) && (data[*ploc].key != key))

(*ploc)++;

*pfound = (*ploc < n + 1);

}

A special but common case is when the search is always for a key known to be stored in the array, and each key stored is distinct and equally likely to be the search key. This is like standing in front of a locked door with a bunch of keys. You need to find the key that opens the door from among *n* keys in arbitrary order on your key-ring. The average time for such a search would be *O*(*n*/2). That is, on the average, you would look at half the keys, or half the elements of the array, before finding the right one.

Thus the procedure called *linear search* is so named, not only because it proceeds linearly through an array, but also because its processing time increases linearly with *n*; when *n *is doubled, the time required to complete the search is doubled.

#define MAX 100

typedef struct

{

int key;

}record;

typedef record recordarray[MAX];

linearsearch(data,n,key,pfound,ploc)

/* Searches the n records stored in

the array data for the search key.

Found will be true only if a record

is present with key field equal to

the search key value and then loc

will contain its array index.

*/

recordarray data;

int n,key,*pfound,*ploc;

{

*ploc = 1;

data[n + 1].key = key;

while (data[*ploc].key != key)

(*ploc)++;

*pfound = (*ploc < n + 1);

}

Another way to save time in a linear search is to order the records in the array. One way to do this is to store them in decreasing order of key value. The linear search can then be modified by terminating the search as soon as a record is reached whose key value is less than the search key. The desired record cannot be in the array past that point. Although this results in extra comparisons for successful searches, time is saved overall when unsuccessful searches occur frequently enough.

It is also possible, if the frequency with which records will be retrieved is known, to store the records in decreasing order of retrieval frequency. The most frequently retrieved record will then be at the top of the array. This approach significantly improves search efficiency when the desired record is likely to be near the top of the array. In fact, if the frequencies decrease rapidly enough, the average search time for the linear search will be a small constant.

Certainly no one does a linear search to look for a phone number in the telephone directory. Imagine trying to do so in the directory for New York City or any other large city! But there are ways to carry out the search so that the name is found in a few seconds, even from among a few million names. The structure of the telephone directory is the clue to the solution, which takes advantage of the alphabetical ordering of the entries in the phone book.

Assume that the numbers stored in data are arranged in decreasing order, as in Figure 8.2. They are said to be * sorted* in decreasing order. Since a telephone book is sorted in alphabetical order, one of the things most people do by habit when using the telephone directory is to flip the directory open toward the front, middle, or back depending on the location in the alphabet of the name being sought. If the page opened to has names farther down in the alphabet, the search name cannot be on that page or anywhere in the directory to its right. It can only be to the left of the selected page. This process of elimination is behind the search technique called

The basic structure of the binary search algorithm is a *loop*, as shown in the following function.

#define TRUE 1

#define FALSE 0

#define MAX 100

typedef struct

{

int key;

}record;

typedef record recorddarray[MAX];

binarysearch(data,n,key,pfound,ploc)

/* Searches the n records stored in order by

key field value(largest first) in the array

data for the search key. Found will be true

only if a record is present with key field

equal to the search key value and then loc

will contain its array index.

*/

recordarray data;

int n,key,*pfound,*ploc;

{

int top, mid,bottom;

top = 1;

bottom = n;

*pfound = FALSE;

*ploc = 0;

while ((top <= bottom) && !(*pfound))

{

mid = (top + bottom)/2

if (data[mid].key == key)

{

*pfound = TRUE

*ploc = mid;

}

else if (data[mid].key < key)

bottom = mid - 1;

else

top = mid + 1;

}

}

Mid is initialized to the middle element of data, top to 1, and bottom to n. Top and bottom are variables that point to the top and bottom elements of the consecutive elements left to be searched, all others having been eliminated from consideration. In the loop body, test for key equal to data [mid]. If the equality is true, the program exits the loop and the search is done, since key has been found at element mid. If not, top and bottom must be updated to reflect the new sequence of elements that are to be searched. This updating must set bottom to mid whenever key is greater than data [mid], and** **top** **to mid + 1 otherwise. The test for continuing the loop involves checking whether key was found, and whether any elements are left to be searched between top and bottom. The search is terminated as soon as the desired entry is found or as soon as it is concluded that the entry is not present. Loc will be zero if the key value was not found and will index its location in data if found.

How much time is required by the binary search algorithm in the worst case? Each pass through the loop requires at most a constant amount of work. The time to execute the algorithm will then be determined by the number of passes through the loop, plus, of course, a constant time for initialization and finalization. What is the greatest number of loop iterations executed?

Sometimes, telephone directories, like dictionaries and encyclopedias, have tabs so that users can flip open the book to the beginning of the section containing entries with a specific first letter and gauge whether to search toward the front, middle, or back pages of the section. The tabs have been inserted or interpolated within the data to guide the search. This represents a generalization of the binary search called the * interpolation search*. The topic will not be pursued in detail, but note that this technique allows the elimination of greater fractions of the remaining names than does a binary search. Not surprisingly, this tends to reduce the execution time.

When *n* is large, the binary search takes much less time, in the worst case, than the linear search. This is because it makes 1g *n* rather than *n* comparisons, going through its loop body 1g *n* rather than *n* times. However, the steps performed in the loop body of the binary search are more complex and more time-consuming than the steps performed in the loop body for the linear search. Hence it is possible, for smaller values of *n*, that the linear search will be faster. Similarly, the interpolation search will need to perform more complex steps in its loop body, even though it may need to loop fewer times than the binary search.

We now turn our attention to three simple sorting methods applicable to records stored in arrays: (1) maximum entry sort, (2) bubble sort, and (3) insertion sort. They are also applicable to records stored in lists. Understanding their limitations and how they work sets the stage for the more complex structures and algorithms of the next section.

#define MAX 100

typedef struct

{

int key;

}record;

typedef record recordarray[MAX];

maxloc (data,next,n)

/* Returns the array index of the

largest entry in data between the

entries indexed by next and n.

*/

recordarray data;

int next,n;

{

int largest,i,tmaxloc;

tmaxloc = next;

largest = data[next].key;

for (i = next + 1; i <= n; i++)

if (data[i].key > largest)

{

tmaxloc = i;

largest = data[i].key;

return (tmaxloc);

}

}

maxentrysort (data, n)

/* Sorts the n records stored in array data

in descending order.

*/

recordarray data;

int n;

{

int next, loc;

for (next=1; next<=n - 1; next++)

{

loc = maxloc (data,next,n);

interchange (data, next, loc);

}

}

For the simple record used here for illustration, the interchange function is as follows.

interchange (data, i, j)

/* Interchanges the entries of

array data indexed by i and j.

*/

recordarray data;

int i,j;

{

record tempdata;

tempdata.key = data[i].key;

data[i].key = data[j].key;

data[j].key = tempdata.key;

}

The * bubble sort* gets its name from the way smaller entries "sink" to the bottom of the array during the sort while larger entries "bubble up" to the top. To carry out the bubble sort on the array shown in Figure 8.4(a), traverse the array, comparing two adjacent elements, starting with the first two, and ending with the last two. If the [

In general, after the *k*th pass, the *k* smallest entries will be in order in the bottom *k* elements of the array. Of course, more than the *k* smallest may be in order at this point, but only *k* can be guaranteed. Hence the array will be in sorted order after at most *n* - 1 passes. The procedure can therefore be modified to compare down to only the [*n* - *k*]th element on the *k*th pass. The bubble sort is implemented with the following function.

#define TRUE 1

#define FALSE 0

#define MAX 100

typedef struct

{

int key;

}record;

typedef record recordarray[MAX];

bubblesort (data, n)

/* Sorts the n records stored in array data

in descending order.

*/

recordarray data;

int n;

{

int i,done;

done = FALSE;

while (!done)

{

done = TRUE;

for (i=1;i<=n-1;i++)

if (data [i +1].key > data [i].key)

{

interchange (data, i, i + 1);

done = FALSE;

}

}

}

If the array were initially given in sorted order, only one pass would be required, and the time would be proportional to *n*. If the array were initially given in *reverse* sorted order, then *n* - 1 passes would be required, with (*n* - 1 - *k*) interchanges and comparisons. This is the worst case, and takes time determined by the (*n* - *k*) interchanges and comparisons. This sum is (*n* - 1) + (*n* - 2) + + 1 or [*n*(*n* - 1)]/2 interchanges and comparisons. The worst-case time for the bubble sort is *O*(*n*^{2}), just as for the maximum entry sort. However, the maximum entry sort will always take time *O*(*n*^{2}), since the number of comparisons it makes is not dependent on the data. The time taken by the bubble sort thus varies from *O*(*n*) for sorted data to *O*(*n*^{2}) for reverse sorted data. The more ordered the initial data, the shorter the time.

The next sort to be considered is insertion sort. An * insertion sort* works by assuming that all entries in the array are already in sorted order and inserting the next entry in its proper place in the order. Figure 8.5 presents an example of an insertion sort.

#define TRUE 1

#define FALSE 0

#define MAX 100

typedef struct

{

int key;

}record;

typedef record recordarray [MAX];

insertionsort (data, n)

/* Sorts the n records stored in array data

in descending order.

*/

recordarray data;

int n;

{

int k, i, inserted;

for (k=2;k<=n; k++)

{

i = k;

inserted = FALSE;

while ((i > 1) && !inserted)

if (data[i - l].key < data[i].key)

{

interchange(data,i,i - 1);

i--;

}

else

inserted = TRUE;

}

}

Within the loop, process the kth record by inserting it properly among the top k elements. This may be accomplished by another loop, in which record k is compared to each consecutive predecessor until it does not exceed the current predecessor record. It is inserted at that point in the array. As an exercise, you should modify the insertionsort program to incorporate the programming trick of Section 8.2.2. In Table 8.1 (see Section 8.7), this is referred to as the *modified *insertion sort.

If the initial array were given in sorted order, the insertion of the *k*th record would require one comparison and no insertions. The total time would be *O*(*n* - 1). If the initial array were given in reverse sorted order, the insertion of the *k*th record would require (*k* - 1) comparisons, and the *k*th record would have to be inserted at the top of the array. To accomplish this, each of the (*k* - 1) records must be moved down one element. The total time will be the time for the 1 + 2 + + (*n* - 1) comparisons, plus the time for the 1 + 2 + + (*n* - 1) shifts down. Consequently, the total time is *O*(*n*^{2}). This represents the worst case. Again we see that the time can vary with *n*^{2} for large *n*. The insertion sort, like the bubble sort, performs better the closer the initial data is to being sorted. If each of the *n*! possible initial orderings of the array with *n* distinct entries is equally likely to occur, then the average time for all three of the sorts considered so far is also *O*(*n*^{2}).

The next question is, can we improve the insertion sort? We could try to improve it by changing how the search is done for the correct location of the *k*th entry. In effect, the insertion sort traversed *linearly* through records [*k* - 1] to 1 to insert the *k*th record. Instead, we could do a *binary search* of the consecutive elements 1 to [*k* - 1]. We will then know where to insert the *k*th record. This will reduce the 1 + 2 + + (*n* - 1) comparisons to 1g 2 + 1g 3 + + 1g (*n* - 1), which is about *n* 1g *n*. It does not, however, alleviate the need for the 1 + 2 + + (*n* - 1) shifts due to the insertions, so the time is still *O*(*n*^{2}).

In this section and the next, two less obvious but faster sorts are considered. These two sorts, heapsort and quicksort, were developed from different points of view and as a result have different advantages.

Heapsorts use the concept of a heap. A* heap* is a binary tree with the property that all records stored in nodes are in sorted order along the path from each node back to the root. The tree of Figure 8.7 is not a heap. Its records are

To sort, output 536 first, and then remove it from the tree. Now suppose that somehow the records can be rearranged in the tree to create a new heap--a process called* reheaping*. The result might look like Figure 8.8(b). This tree has one less node than the original. Clearly, this process of removing the root can be repeated, outputting it next in the sorted output, and reheaping. When all the records have been output in sorted order, the remaining tree will be the null binary tree (the heap will be empty), and a sort will have been obtained. The algorithm is stated simply:

2. While the heap is not empty

a. output the record at the root of the heap,

b. remove it from the heap, and

Consider the heap of Figure 8.8(a) with one new record as in Figure 8.8(c). Notice that all paths are in order, except for the path from the new record at node 21 to the root. This prevents this tree from being a heap. To make it a heap, the insertion sort idea may be useful. Insert the new record at its proper place along the already sorted path from its predecessor node to the root, and allow 425 to work its way up along the path as far as it can go the reach its proper place. This can be done by comparing 425 to its predecessor and interchanging if it exceeds its predecessor's value. In this case, it does exceed that value, so interchange to obtain the tree shown in Figure 8.9(a).

This gives us the means **to create a heap:**

In order to analyze the time required, recall that the number of nodes *n* and the depth of a complete binary tree are related by

d= lg_{2}(n+ 1)

One way to implement the heap creation algorithm is to make use of a function, or routine, that works on a binary tree whose left and right subtrees are already heaps. The function does the "shifting" down of the root record until it becomes the root of a subtree that remains a heap, or until it reaches a terminal node. This "shift" routine need not do the interchanging along the way. Instead it can determine the final position of the root being shifted down, moving each record along that path to its predecessor, finally placing the root where it belongs.

These tests can be avoided by always putting "plus infinity" at the root of the initial complete binary tree for the first algorithm, and "negative infinity" at the subtrees of all terminal nodes for the second algorithm. This is similar in effect to adding the search key to the end of the array in the linear search algorithm. * Plus infinity* and

Think of step 1 of the heap sorting algorithm as phase I, in which a heap is created. The execution of step 2 until the records are sorted is phase II. In phase II, the root element of the heap is removed and output. The program must then reheap. This may be done in two ways.

We have been dealing with a **max heap****,** in which the largest record is at the top. A **min heap****,** with the smallest record at the top, is dealt with similarly. The only difference is that interchanging occurs when a successor's key is smaller, rather than larger.

**2.** Ptr points to a terminal node or to a node whose successors are nodes among last+1 to n.

Copy copies the contents of the record pointed to by its second parameter into the record pointed to by its first parameter. When shift determines that ptr is not the final position, it moves the record at ptr to its predecessor node. This is why the original record at root is saved. When the final position is found, the saved record is placed there. The initial heap and the situation after the first three entries have been output are shown in Figure 8.13 for the tree of Figure 8.7.

The implementation for heapsort follows.

#define MAX 100

typedef struct

{

int key;

}record;

typedef record recordarray[MAX];

heapsort (data,n)

/* Sorts the n records stored in array data

in descending order.

*/

recordarray data;

int n;

{

int root,last;

root = n/2;

last = n;

while (root > 0)

{

shift(data,root,last);

root--;

}

root = 1;

while (last > 1)

{

interchange(data,1,last);

last--;

shift(data,root,last);

}

}

shift (data,root,last)

/* Makes the subtree of root into

a heap (min heap). The left and

right subtrees of root must be

heaps. All subtrees are stored

in array data in sequential

representation. Only the nodes

between root and last are

considered to be in the subtrees.

*/

recordarray data;

int root,last;

{

int ptr,succ;

int keyvalue;

record recordvalue;

ptr = root;

succ = 2 * ptr;

if (succ < last)

if (data[succ + 1].key < data[succ].key)

succ++;

at this point, *succ* points to the smallest of ptr's successors

keyvalue = data[root].key;

copy(&recordvalue,&data[root]);

the purpose of the * while* loop is to determine the proper place for the original root record; ptr will

point to that location when the loop is exited

while((succ <= last) && (data[succ].key < keyvalue))

{

copy(&data[ptr],&data[succ]);

ptr = succ;

succ = 2 * ptr

if (succ < last)

if (data[succ + 1].key < data[succ].key)

succ++;

}

copy(&data[ptr],&recordvalue);

}

As a final example of a sorting method, we consider an algorithm called * quicksort*. This algorithm is appropriately named, since its average execution is very quick. Its worst-case execution time is slow, but the worst case should occur rarely.

Thus data is to be separated into two components to be sorted, an upper array and a lower array. We then have two options for proceeding. One is to *combine* or *merge* the resultant two sorted component arrays to achieve a final sorted solution. This method is the * merge sort,* which is not pursued here but is applied to the sorting of files in Chapter 11. The second approach, which is taken here, is to ensure that the resultant two sorted component arrays require

Note that the fifth entry, 13, occupies the same position in the sorted array as in the initial array. This is because all entries in the upper component are no less, and all entries in the lower component are no greater, than 13. The key value 13 actually occupied its proper position in the sorted version of data before the two components were sorted. Hence it need not be included in either component and may serve as a separator or * partitioning entry* between the two components to be sorted.

The procedure quicksort may now be written, recursively, as follows.

quicksort(i,j)

/* Sorts the records stored in array data

between positions indexed by i and j in

descending order.

*/

int i,j;

{

int p;

if(i < j)

{

p = partition(i,j);

quicksort(i,p-1);

quicksort(p+1,j);

}

}

To sort a global array of n records, quicksort(1,n) is invoked. The crux of the algorithm is the partition function, which will be discussed shortly.

Every time a call is made to quicksort, two new obligations are incurred, and a current obligation is postponed. For instance, the initial call to quicksort generates the initial obligation: sort the n entries of data. During the execution of this call, partition determines a partitioning entry, and two new obligations are incurred. These are represented by the two recursive calls to quicksort, one for each component. The current obligation must then be postponed to carry out these new obligations. Each recursive call also generates two new obligations, requiring the execution of the current recursive call to be postponed.

As we have seen previously, what can be done recursively can also be done iteratively. Quicksort can be written nonrecursively by using a stack to store one of the new obligations, while the other can be processed immediately. It turns out that the choice of which of the two obligations to stack, and which to deal with immediately, is critical to the algorithm's storage requirements. The partition function does not necessarily split an array into two equal parts, as evidenced by our earlier example. Selecting the larger of the two components to stack results in a worst-case stack depth of *O *(lg *n*). An arbitrary choice instead leads to a worst-case stack depth of *O* (*n*). The difference in storage needed for the stack can be significant.

quicksort(i,j)

/* Sorts the records stored in array data

between positions indexed by i and j in

descending order.

*/

int i,j;

{

int p,top,bottom;

stack s;

setstack(&s);

push(j,&s);

push(i&s);

while(!empty(&s))

{

pop(&s,&top);

pop(&s,&bottom);

while(top < bottom)

{

p = partition(top,bottom);

if((p - top) > (bottom - p))

{

push(p-l,&s);

push(top,&s);

top = p + l

}

else

{

push(bottom,&s);

push(p+1,&s);

bottom = p - 1;

}

}

}

}

We must now return to the partitioning task on which the efficiency of both versions of quicksort depends. To determine a partitioning entry, and a corresponding rearrangement, the first array entry is selected as a basis. This is a somewhat arbitrary selection and will be discussed more fully later. In general, this first array entry will not be in its proper sorted position. It will need to be shifted, and the array rearranged, so that all larger entries appear in the upper component and all smaller entries appear in the lower component.

A slight modification makes partition more efficient. Instead of *interchanging* the basis with the larger or smaller entry that has been found, simply *move* the larger or smaller entry to the position pointed to by the current stationary pointer. This is the position that would have been occupied by the basis (13 in the example). However, the basis does not now appear in the array but is saved in a temporary variable, saverecord. For the example, this would generate the sequence shown in Figure 8.17. When upper and lower meet, the record stored in saverecord must be copied into data[upper]. This version of partition is faster because a move is carried out much faster than an interchange, and many such operations are required as the algorithm executes. It is therefore the algorithm implemented below.

partition(i,j)

/* Rearranges the entries in array

data between positions indexed by

i and j and returns an array index

which guarantees that all entries

above it are larger than the entry

it indexes, and all entries below it

are smaller.

*/

int i,j;

{

int upper,lower;

record saverecord;

upper = i;

lower = j;

copy(&saverecord,&data[i]);

while(upper != lower)

{

while((upper < lower) && (saverecord.key >=

data[lower].key))

lower--;

if(upper != lower)

copy(&data[upper],&data[lower]);

while((upper < lower) && (saverecord.key <=

data[upper].key))

upper++;

if(upper != lower)

copy(&data[lower],&data[upper]);

}

copy(&data[upper],&saverecord);

return(upper);

}

It is important to recognize that whenever partition (i,j) is invoked, it processes only entries between i and j of the data array and deals with each of these entries exactly once. Each entry is compared with saverecord, and perhaps copied into data [upper] or data [lower]. Consequently, a call to partition (i,j) takes time *O*(*j* - *i* + 1), since there are *j* - *i* + 1 entries between i and j.

We now want to analyze the nonrecursive version of quicksort to determine its time and storage requirements. To provide some insight into its behavior, we will simulate its application to our initial array, focusing on the changes that occur in this array and in the stack during execution.

Although sorting has been extensively researched, a new generalization of the quicksort, *distributive partitioning*, has recently been discovered [Dobosiewicz,1976]. It appears to be even faster than quicksort. Recall that the quicksort was based on the idea of rearranging the initial array so that it could be split into two component arrays to be sorted independently. Each of these arrays was then recursively sorted in the same way. * Distributive partitioning* carries this idea to its logical limit, by splitting the initial array into

It is often difficult, if not impossible, to analyze an algorithm mathematically and determine its time and storage requirements. Simulation is a useful basic tool for determining the behavior of an algorithm for this purpose. To illustrate, an algorithm for a simulation of the heapsort is constructed in this section.

The storage required by the heapsort algorithm in the implementation is *O*(*n*). A reasonable measure for the time required is given by the number of interchanges and comparisons necessary to produce a sort. Assume that a procedure treesort has been written that is one of the detailed implementations discussed previously for heapsort. It has parameters n, a, sa, int1, int2, comp1, and comp2. These represent, respectively, the number of records, the input array of records to be sorted, the output array of sorted records, the number of interchanges required in phases I and II, and the number of comparisons required in phases I and II.

= 40n

-------------

Input Class Min Ave Max

------------------------------------

"Nearly sorted" int1 * * *

int2 * * *

comp1 * * *

comp2 * * *

"Random" int1 * * *

int2 * * *

comp1 * * *

comp2 * * *

"Nearly reverse int1 * * *

sorted" int2 * * *

comp1 * * *

comp2 * * *

b. Initialize statistics tables.

d. Output the statistics tables for the current value of n and for each input class.

Task (i) may be carried out as follows:

a. Generate a "random" sample for a of size n and call treesort(n, a, sa, int1, int2, comp1, comp2).

b. Update the statistics table for "random" input.

e. Update the statistics tables for "nearly sorted" input.

f. Call treesort(n, ra, sa, int1, int2, comp1, comp2).

g. Update the statistics table for "nearly reverse sorted" input.

Assume that a function rng is available that returns a real number as its value. This real number will be greater than 0.0 and less than 1.0. Whenever a sequence of these numbers is generated by consecutive calls to rng, the numbers appear, from a statistical point of view, to have been selected independently. Also assume that the returned value will lie in a segment of length *l* between 0 and 1, with probability *l*. For example, if *l* is of length 0.25, then the returned value will lie in that interval (say, between 0.33 and 0.58) 25 percent of the time. The random number generator ranf of FORTRAN is an example of such a function.

To generate a* nearly sorted* or *nearly reverse sorted* input, select each of the 20 percent of n interchanges as follows. Call rng twice. Multiply the two returned values by n, take the integer part, and add 1. Or evaluate rand()%(high-low+1) + low twice with high set to n and low set to 1. Either method gives two integers, i and j, between 1 and n. Interchange sa[i] with sa[j].

Table 8.1 gives simulation results (in actual execution time measured in 1/1000 second intervals). It should be carefully studied. Note that the *O*(*n*) and *O*(*n *lg *n*) sorts may be easily identified. The best and worst input cases are also evident, as well as the relatively uniform behavior of heapsort.

We have consistently viewed the record as the basic unit of data. Collections of records usually appear in problems, and their processing typically reduces to traversing through the records of the collection, inserting or deleting records, randomly accessing records, accessing records in sorted order, or searching the collection for a given record. The basic data structures are available for storage of the collection, and each supports some operations well at the expense of others. Arrays are excellent for random access and traversal but poor for general insertions and deletions. Lists support traversal, insertions, and deletions well but are poor for random access.

n Bubble Sort Insertion Sort Modified Insertion Sort

-----------------------------------------------------------------------------------------------------------

10 1 2 6 8 9 1 2 5 7 7 0 2 4 6 9

20 1 8 23 33 36 1 5 16 28 30 1 6 17 25 30

40 2 31 101 138 148 2 16 67 110 121 5 16 65 107 120

100 4 217 646 863 923 7 89 412 678 761 6 87 402 660 744

200 8 966 2653 3463 3724 12 364 1657 2706 3058 11 354 1612 2639 2993

400 17 3933 11277 14732 15858 24 1583 7645 12489 14595 20 1373 6352 10581 12157

-----------------------------------------------------------------------------------------------------------

srt'd half rnd'm half rev srt'd half rnd'm half rev srtd half rndm half rev

srt'd rev srt'd srt'd rev srt'd srt'd rev srt'd

half srt'd half srt'd half srt'd

rnd'm half rnd'm half rnd'm half

rnd'm rnd'm rnd'm

n Heapsort Iterative Quicksort Recursive Quicksort

-----------------------------------------------------------------------------------------------------------

10 9 9 8 8 7 10 9 9 8 10 5 5 3 3 4

20 18 19 18 18 15 21 19 15 17 23 13 11 10 11 14

40 43 43 41 40 37 51 47 33 38 55 35 31 21 25 36

100 130 126 122 119 115 196 171 94 106 202 156 133 65 74 158

200 291 282 277 264 256 626 521 202 231 633 581 484 149 169 602

400 640 624 610 587 572 2175 1750 432 492 2195 2042 1644 318 372 2056

srt'd half rnd'm half rev srt'd half rnd'm half rev srtd half rndm half rev

srt'd rev srt'd srt'd rev srt'd srt'd rev srt'd

half srt'd half srt'd half srt'd

rnd'm half rnd'm half rnd'm half

rnd'm rnd'm rnd'm

* These results were produced by Sharon Doherty.

It is important to recognize that in this chapter we have been considering *comparative* searches and sorts. This means that we have assumed no special knowledge of the key values involved and have used only comparisons between key values in each search or sort algorithm. It is possible to prove that such searches and sorts have worst-case times *O*(lg *n*) and *O*(*n* lg *n*), respectively.

Other kinds of search and sort algorithms are possible. Some may take significantly less time. For example, suppose you must sort *n* distinct keys whose values are known to be integers between 1 and 1,000. Simply store the record with key value *i* as the *i*th record in an array of records. This places the records in the array in (decreasing) sorted order and takes *O*(*n*) time. Sorts that place records in groups, depending on their *actual* key values, and then attempt to sort further within the groups, are called *distributive sorts*. These are not considered in this text. We will, however, consider a very important noncomparative search in Section 9.3.

As searching as this chapter may have been, it is hoped that it didn't leave you out of sorts!

**1. a.** If all comparisons and assignment statements take the same amount of time to execute, then compare the execution times of the first linear search procedure and the procedure revised for greater efficiency.

**2.** When you look up a book by author in the library, what kind of a search are you doing?

**3.** In a * ternary search* of a sorted array, the array is divided into three "nearly" equal parts and searched. Key values are examined "near" 1/3 and "near" 2/3 and compared with the search key to determine in which third to search further. Determine the worst-case time for this search and compare it to the worst-case time for a binary search.

**4.** Suppose a sorted array stores 1,000 integers between 0 and 1,000,000. A pointer array indicates where the least integer greater than 1,000 x (i - 1) but not greater than 1,000 x i appears in the array. P[i] contains that pointer for i = 1, 2, . . . , 1,000. If no such integer is in the array, then p[i] = 0. Write a function to do an interpolation search, coupled with a binary search.

**5.** Assuming all comparisons and assignments take the same time to execute, do the following.

**b.** For each value of *n*, determine which of the three sorts has least worst-case time.

**6.** What input order produces the worst case time for

**7.** Write a function to carry out the bubble sort when the integers are stored in a chain.

**8.** Write a function to carry out the insertion sort when the integers are stored in a chain.

**9.** Modify the insertionsort function so that it does a binary search for the proper insertion position, and then does the insertion.

**10. **Modify the function so that it does not deal with the last i elements after the ith pass.

**11. **Suppose input is given in the order

16, 18, 22, 15, 4, 50, 17, 31, 4, 90, 6, 25

**a. **The first heap creation algorithm of Section 8.4.

**b. **The second heap creation algorithm of Section 8.4.

**c. **How many comparisons and interchanges were required in Exercise 11(a) and Exercise 11(b)?

**12.** For each heap of Exercise 11(a) and 11(b), produce

**c. **How many comparisons and interchanges were required in Exercises 12(a) and12(b)?

**13.** Why does the pointer array of Section 8.3.6 fail to help?

**14.** Show the entries of the data array, for the input of Exercise 11, when the heapsort function is applied after the first three elements have been output.

**15.** Modify the heapsort function so that it counts the number of interchanges and comparisons in each phase and returns their values as parameters.

**16.** Modify the heapsort function so it sorts a data array of pointers to records instead of records themselves.

**17.** What input sequence will cause the second heap creation algorithm to take worst-case time? Calculate the number of comparisons and interchanges it requires.

**18.** Determine the number of comparisons and copies required by quicksort for an array of *n* records, which is (a) in sorted order; (b) in reverse sorted order; (c) in order so that partition always produces two component arrays whose lengths differ by at most 1.

**19.** Modify partition so that it selects an entry at random to be used as the basis entry.

**20.** Modify partition so that it selects the median of the first, middle, and last entries as a basis entry.

**21.** Explain why the implementation given for** **quicksort requires a stack depth *O*(lg *n*).

**22.** Modify quicksort so that it sorts all component arrays of length at most 20 by using insertionsort.

**23.** Generation of *n* integers to fill the array a in Section 8.6 randomly was really only an approximation of a random input. This is because duplicate values might occur. While the effect of this duplication may be negligible for the simulation results, a "true" random input can be generated. You can fill array a as follows: For i from 1 to n, set a[i] to i.Then, for i from 1 to n, pick an integer int at random between i and n, and interchange a[i] and a[int].

**a. **Convince yourself that this will generate a "'true" random input.

**b.** Write a procedure to generate this "true" random input.

**24.** Consider the "perfect shuffle" example of Chapter 3. When 2*n* - 1 is a prime number *p*, then the number of shuffles required is *p* - 1. The initial configuration is in sorted order, and so is the final configuration after *p* - 1 shuffles. The configurations produced along the way represent different permutations of the initial input. With successive reshufflings, the configurations seem more and more like a "random" arrangement, and then less and less like a "random" arrangement. Take *p* = 101 and use the one hundred configurations, produced by the hundred shuffles required to reproduce the original, as inputs to the heapsort algorithm. Output the statistics table based on one hundred samples of an input of size 101.

**1.** Write and run the heapsort function of Section 8.4.7. A main program should read in examples for the input array a. For each example, n and the array a should be printed, and heapsort should be called to sort **a.** When heapsort returns, the output array, sa, should be printed along with the values for int1,int2,comp1, and comp2. At least two examples for small n should be worked through by hand to test the correctness of heapsort.

**2. **Write an efficient recursive function minheaps (t,count,flag) to return in count the number of subtrees of the binary tree pointed to by t that are minheaps, and to return in flag *true* if t points to a minheap and *false* otherwise.