Data Structures and Algorithms: CHAPTER 11: Data Structures and Algorithms for External Storage

All of these methods are remarkably good compared to the obvious sequential scan of a file. The timing differences among them, however, are small and difficult to determine analytically, especially considering that the relevant parameters such as the expected file length and the occupancy rates of blocks are hard to predict in advance.

It appears that the B-tree is becoming increasingly popular as a means of accessing files in database systems. Part of the reason lies in its ability to handle queries asking for records with keys in a certain range (which benefit from the fact that the records appear in sorted order in the main file). The sparse index also handles such queries efficiently, but is almost sure to be less efficient than the B-tree. Intuitively, the reason B-trees are superior to sparse indices is that we can view a B-tree as a sparse index on a sparse index on a sparse index, and so on. (Rarely, however, do we need more than three levels of indices.)

B-trees also perform relatively well when used as secondary indices, where "keys" do not really define a unique record. Even if the records with a given value for the designated fields of a secondary index extend over many blocks, we can read them all with a number of block accesses that is just equal to the number of blocks holding these records plus the number of their ancestors in the B-tree. In comparison, if these records plus another group of similar size happen to hash to the same bucket, then retrieval of either group from a hash table would require a number of block accesses about double the number of blocks on which either group would fit. There are possibly other reasons for favoring the B-tree, such as their performance when several processes are accessing the structure simultaneously, that are beyond the scope of this book.

Exercises

11.1	Write a program concatenate that takes a sequence of file names as arguments and writes the contents of the files in turn onto the standard output, thereby concatenating the files.
11.2	Write a program include that copies its input to its output except when it encounters a line of the form #include filename, in which case it is to replace this line with the contents of the named file. Note that included files may also contain #include statements.
11.3	How does your program for Exercise 11.2 behave when a file includes itself?
11.4	Write a program compare that will compare two files record-by-record to determine whether the two files are identical.
*11.5	Rewrite the file comparison program of Exercise 11.4 using the LCS algorithm of Section 5.6 to find the longest common subsequence of records in both files.
11.6	Write a program find that takes two arguments consisting of a pattern string and a file name, and prints all lines of the file containing the pattern string as a substring. For example, if the pattern string is "ufa" and the file is a word list, then find prints all words containing the trigram "ufa."
11.7	Write a program that reads a file and writes on its standard output the records of the file in sorted order.
11.8	What are the primitives Pascal provides for dealing with external files? How would you improve them?
*11.9	Suppose we use a three-file polyphase sort, where at the i^th phase we create a file with r_i runs of length l_i. At the n^th phase we want one run on one of the files and none on the other two. Explain why each of the following must be true l_i = l_i-i + l_i- 2 for i ?/FONT> 1, where l₀ and l_-1 are taken to be the lengths of runs on the two initially occupied files. r_i = r_i-2 - r_i- 1 (or equivalently, r_i-2 = r_i-1 + r_i for i ?/FONT> 1), where r₀ and r_-1 are the number of runs on the two initial files. r_n = r_n-1 = 1, and therefore, r_n, r_n-1, . . . ,r₁, forms a Fibonacci sequence.
*11.10	What additional condition must be added to those of Exercise 11.9 to make a polyphase sort possible with initial runs of length one (i.e., l_o=l_- 1=1) running for k phases, but with initial runs other than one allowed. Hint. Consider a few examples, like l_n = 50, l_n-1 = 31, or l_n = 50, l_n-1 = 32.
**11.11	Generalize Exercises 11.9 and 11.10 to polyphase sorts with more than three files.
**11.12	Show that: Any external sorting algorithm that uses only one tape as external storage must take W(n²) time to sort n records. O(n log n) time suffices if there are two tapes to use as external storage.
11.13	Suppose we have an external file of directed arcs x ?/FONT> y that form a directed acyclic graph. Assume that there is not enough space in internal memory to hold the entire set of vertices or edges at one time. Write an external topological sort program that prints out a linear ordering of the vertices such that if x ?/FONT> y is a directed arc, then vertex x appears before vertex y in the linear What is the time and space complexity of your program as a function of the number of block accesses? What does your program do if the directed graph is cyclic? What is the minimum number of block accesses needed to topologically sort an externally stored dag?
11.14	Suppose we have a file of one million records, where each record takes 100 bytes. Blocks are 1000 bytes long, and a pointer to a block takes 4 bytes. Devise a hashed organization for this file. How many blocks are needed for the bucket table and the buckets?
11.15	Devise a B-tree organization for the file of Exercise 11.14.
11.16	Write programs to implement the operations RETRIEVE, INSERT, DELETE, and MODIFY on hashed files, indexed files, B-tree files.
11.17	Write a program to find the k^th largest element in a sparse-indexed file a B-tree file
11.18	Assume that it takes a + bm milliseconds to read a block containing a node of an m-ary search tree. Assume that it takes c + d log₂m milliseconds to process each node in internal memory. If there are n nodes in the tree, we need to read about log_mn nodes to locate a given record. Therefore, the total time taken to find a given record in the tree is (log_mn)(a + bm + c + d log₂m) = (log₂n)((a + c + bm)/log₂m + d) milliseconds. Make reasonable estimates for the values of a, b, c, and d and plot this quantity as a function of m. For what value of m is the minimum attained?
*11.19	A B-tree is a B-tree in which each interior node is at least 2/3 full (rather than just 1/2 full). Devise an insertion scheme for B-trees that delays splitting interior nodes until two sibling nodes are full. The two full nodes can then be divided into three, each 2/3 full. What are the advantages and disadvantages of B*-trees compared with B-trees?
*11.20	When the key of a record is a string of characters, we can save space by storing only a prefix of the key as the key separator in each interior node of the B-tree. For example, "cat" and "dog" could be separated by the prefix "d" or "do" of "dog." Devise a B-tree insertion algorithm that uses prefix key separators that at all times are as short as possible.
*11.21	Suppose that the operations on a certain file are insertions and deletions fraction p of the time, and the remaining 1-p of the time are retrievals where exactly one field is specified. There are k fields in records, and a retrieval specifies the i^th field with probability q_i. Assume that a retrieval takes a milliseconds if there is no secondary index for the specified field, and b milliseconds if the field has a secondary index. Also assume that an insertion or deletion takes c + sd milliseconds, where s is the number of secondary indices. Determine, as a function of a, b, c, d, p, and the q_i's, which secondary indices should be created for the file in order that the average time per operation be minimized.
*11.22	Suppose that keys are of a type that can be linearly ordered, such as real numbers, and that we know the probability distribution with which keys of given values will appear in the file. We could use this knowledge to outperform binary search when looking for a key in a sparse index. One scheme, called interpolation search, uses this statistical information to predict where in the range of index blocks B_i, . . . ,B_j to which the search has been limited, a key x is most likely to lie. Give an algorithm to take advantage of statistical knowledge in this way, and a proof that O(loglogn ) block accesses suffice, on the average, to find a key.
11.23	Suppose we have an external file of records, each consisting of an edge of a graph G and a cost associated with that edge. Write a program to construct a minimum-cost spanning tree for G, assuming that there is enough memory to store all the vertices of G in core but not all the edges. What is the time complexity of your program as a function of the number of vertices and edges? Hint. One approach to this problem is to maintain a forest of currently connected components in core. Each edge is read and processed as follows: If the next edge has ends in two different components, add the edge and merge the components. If the edge creates a cycle in an existing component, add the edge and remove the highest cost edge from that cycle (which may be the current edge). This approach is similar to Kruskal's algorithm but does not require the edges to be sorted, an important consideration in this problem.
11.24	Suppose we have a file containing a sequence of positive and negative numbers a₁, a₂, . . . ,a_n. Write an O(n) program to find a contiguous subsequence a_i, a_i+1, . . . ,a_j that has the largest sum a_i + a_i+1 ?? ?/FONT> + a_j of any such subsequence.

Bibliographic Notes

For additional material on external sorting see Knuth [1973]. Further material on external data structures and their use in database systems can be found there and in Ullman [1982] and Wiederhold [1982]. Polyphase sorting in discussed by Shell [1971]. The six-buffer merging scheme in Section 11.2 is from Friend [1956] and the four-buffer scheme from Knuth [1973].

Secondary index selection, of which Exercise 11.21 is a simplification, is discussed by Lum and Ling [1970] and Schkolnick [1975]. B-trees originated with Bayer and McCreight [1972]. Comer [1979] surveys the many variations, and Gudes and Tsur [1980] discusses their performance in practice.

Information about Exercise 11.12, one- and two- tape sorting, can be found in Floyd and Smith [1973]. Exercise 11.22 on interpolation search is discussed in detail by Yao and Yao [1976] and Perl, Itai, and Avni [1978].

An elegant implementation of the approach suggested in Exercise 11.23 to the external minimum-cost spanning tree problem was devised by V. A. Vyssotsky around 1960 (unpublished). Exercise 11.24 is due to M. I. Shamos.

† It is tempting to assume that if (1) and (2) take the same time, then selection could never catch up with reading; if the whole block were not yet read, we would select from the first records of the block, those that had the lower keys, anyway. However, the nature of reading from disks is that a long period elapses before the block is found and anything at all is read. Thus our only safe assumption is that nothing of the block being read in a stage is available for selection during that stage.

† If these are not the first runs from each file, then this initialization can be done after the previous runs were read and the last 4b records from these runs are being merged.

† This strategy is the simplest of a number of responses that can be made to the situation where a block has to be split. Some other choices, providing higher average occupancy of blocks at the cost of extra work with each insertion, are mentioned in the exercises.

† We can use a variety of strategies to prevent leaf blocks from ever becoming completely empty. In particular, we describe below a scheme for preventing interior nodes from getting less than half full, and this technique can be applied to the leaves as well, with a value of m equal to the largest number of records that will fit in one block.

Table of Contents Go to Chapter 12

Data Structures and Algorithms for External Storage

11.1 A Model of External Computation

The Cost Measure for Secondary Storage Operations

11.2 External Sorting

Merge Sorting

Speeding Up Merge Sort

Minimizing Elapsed Time

Multiway Merge

Polyphase Sorting

When Input/Output Speed is not a Bottleneck

A Six-Input Buffer Scheme

A Four-Buffer Scheme

11.3 Storing Information in Files

A Simple Organization

Speeding Up File Operations

Hashed Files

Indexed Files

Unsorted Files with a Dense Index

Secondary Indices

11.4 External Search Trees

Multiway Search Trees

B-trees

Retrieval

Insertion

Deletion

Time Analysis of B-tree Operations

Comparison of Methods

Exercises

Bibliographic Notes