In this chapter we consider techniques to sort large files. The files are assumed to be so large that the whole file cannot be contained in the internal memory of a computer, making an internal sort impossible. Before discussing methods available for external sorting it is necessary first to study the characteristics of the external storage devices which can be used to accomplish the sort. External storage devices may broadly be categorized as either sequential access (e.g., tapes) or direct access (e.g., drums and disks). Section 8.1 presents a brief study of the properties of these devices. In sections 8.2 and 8.3 we study sorting methods which make the best use of these external devices.

Unless otherwise stated we will make the following assumptions about our tape drives:

(i) Tapes can be written and read in the forward direction only.

As an example of direct access external storage, we consider disks. As in the case of tape storage, we have here two distinct components: (1) the disk module (or simply disk or disk pack) on which information is stored (this corresponds to a reel of tape in the case of tape storage) and (2) the disk drive (corresponding to the tape drive) which performs the function of reading and writing information onto disks. Like tapes, disks can be removed from or mounted onto a disk drive. A disk pack consists of several platters that are similar to phonograph records. The number of platters per pack varies and typically is about 6. Figure 8.4 shows a disk pack with 6 platters. Each platter has two surfaces on which information can be recorded. The outer surfaces of the top and bottom platters are not used. This gives the disk of figure 8.4 a total of 10 surfaces on which information may be recorded. A disk drive consists of a spindle on which a disk may be mounted and a set of read/write heads. There is one read/write head for each surface. During a read/write the heads are held stationary over the position of the platter where the read/write is to be performed, while the disk itself rotates at high speeds (speeds of 2000-3000 rpm are fairly common). Thus, this device will read/write in concentric circles on each surface. The area that can be read from or written onto by a single stationary head is referred to as a *track*. Tracks are thus concentric circles, and each time the disk completes a revolution an entire track passes a read/write head. There may be from 100 to 1000 tracks on each surface of a platter. The collection of tracks simultaneously under a read/write head on the surfaces of all the platters is called a *cylinder*. Tracks are divided into sectors. A *sector* is the smallest addressable segment of a track. Information is recorded along the tracks of a surface in blocks. In order to use a disk, one must specify the track or cylinder number, the sector number which is the start of the block and also the surface. The read/write head assembly is first positioned to the right cylinder. Before read/write can commence, one has to wait for the right sector to come beneath the read/write head. Once this has happened, data transmission can take place. Hence, there are three factors contributing to input/output time for disks:

(ii) *Latency time:* time until the right sector of the track is under the read/write head.

(iii) *Transmission time:* time to transmit the block of data to/from the disk.

The most popular method for sorting on external storage devices is merge sort. This method consists of essentially two distinct phases. First, segments of the input file are sorted using a good internal sort method. These sorted segments, known as *runs*, are written out onto external storage as they are generated. Second, the runs generated in phase one are merged together following the merge tree pattern of Example 7.4, until only one run is left. Because the merge algorithm of section 7.5 requires only the leading records of the two runs being merged to be present in memory at one time, it is possible to merge large runs together. It is more difficult to adapt the other methods considered in Chapter 7 to external sorting. Let us look at an example to illustrate the basic external merge sort process and analyze the various contributions to the overall computing time. A file containing 4500 records, *A*_{1}, . . . ,*A*_{4500}, is to be sorted using a computer with an internal memory capable of sorting at most 750 records. The input file is maintained on disk and has a block length of 250 records. We have available another disk which may be used as a scratch pad. The input disk is not to be written on. One way to accomplish the sort using the general procedure outlined above is to:

*t _{rw }*= time to read or write one block of 250 records

*t _{IO} = t_{s} + t_{l} + t_{rw}*

*t _{IS} = *time to internally sort 750 records

*n t _{m} = *time to merge

Operation Time

---------- ----

1) read 18 blocks of input, 18t_{IO}, internally sort, 6t_{IS}

write 18 blocks, 18t_{IO }36 t_{IO}+ 6 t_{IS}

2) merge runs 1-6 in pairs 36 t_{IO}+ 4500 t_{m}

3) merge 2 runs of 1500 records each, 12 blocks 24 t_{IO}+ 3000 t_{m}

4) merge one run of 3000 records with one run of 1500

records 36 t_{IO}+ 4500 t_{m}

--------------------------

Total Time 132 t_{IO}+ 12000 t_{m}+ 6 t_{IS}

The 2-way merge algorithm of Section 7.5 is almost identical to the merge procedure just described (figure 8.6). In general, if we started with *m* runs, then the merge tree corresponding to figure 8.6 would have log_{2}*m** *+ 1 levels for a total of log_{2}*m* passes over the data file. The number of passes over the data can be reduced by using a higher order merge, i.e., *k-*way merge for *k* 2. In this case we would simultaneously merge *k* runs together. Figure 8.7 illustrates a 4-way merge on 16 runs. The number of passes over the data is now 2, versus 4 passes in the case of a 2-way merge. In general, a *k-*way merge on *m* runs requires at most log* _{k}m* passes over the data. Thus, the input/output time may be reduced by using a higher order merge. The use of a higher order merge, however, has some other effects on the sort. To begin with,

Note, however, that the internal processing time will be increased slightly because of the increased overhead associated with maintaining the tree. This overhead may be reduced somewhat if each node represents the loser of the tournament rather than the winner. After the record with smallest key is output, the selection tree of figure 8.9 is to be restructured. Since the record with the smallest key value is in run 4, this restructuring involves inserting the next record from this run into the tree. The next record has key value 15. Tournaments are played between sibling nodes along the path from node 11 to the root. Since these sibling nodes represent the losers of tournaments played earlier, we could simplify the restructuring process by placing in each nonleaf node a pointer to the record that loses the tournament rather than to the winner of the tournament. A tournament tree in which each nonleaf node retains a pointer to the loser is called a *tree of losers.* Figure 8.11 shows the tree of losers corresponding to the selection tree of figure 8.9. For convenience, each node contains the key value of a record rather than a pointer to the record represented. The leaf nodes represent the first record in each run. An additional node, node 0, has been added to represent the overall winner of the tournament. Following the output of the overall winner, the tree is restructured by playing tournaments along the path from node 11 to node 1. The records with which these tournaments are to be played are readily available from the parent nodes. We shall see more of loser trees when we study run generation in section 8.2.3.

In going to a higher order merge, we save on the amount of input/output being carried out. There is no significant loss in internal processing speed. Even though the internal processing time is relatively insensitive to the order of the merge, the decrease in input/output time is not as much as indicated by the reduction to log* _{k} m* passes. This is so because the number of input buffers needed to carry out a

**Example 8.1:** Assume that a two way merge is being carried out using four input buffers, *IN*(*i*), 1 *i* 4, and two output buffers, *OU*(1) and *OU*(2). Each buffer is capable of holding two records. The first few records of run 1 have key value 1, 3, 5, 7, 8, 9. The first few records of run 2 have key value 2, 4, 6, 15, 20, 25. Buffers *IN*(1) and *IN*(3) are assigned to run 1. The remaining two input buffers are assigned to run 2. We start the merging by reading in one buffer load from each of the two runs. At this time the buffers have the configuration of figure 8.12(a). Now runs 1 and 2 are merged using records from *IN* (1) and *IN*(2). In parallel with this the next buffer load from run 1 is input. If we assume that buffer lengths have been chosen such that the times to input, output and generate an output buffer are all the same then when *OU*(1) is full we have the situation of figure 8.12(b). Next, we simultaneously output *OU*(1), input into *IN*(4) from run 2 and merge into *OU*(2). When *OU*(2) is full we are in the situation of figure 8.12(c). Continuing in this way we reach the configuration of figure 8.12(e). We now begin to output *OU*(2), input from run 1 into *IN*(3) and merge into *OU*(1). During the merge, all records from run 1 get exhausted before *OU*(1) gets full. The generation of merged output must now be delayed until the inputting of another buffer load from run 1 is completed!

(iii) To simplify the discussion we assume that input and output buffers are to be the same size.

IN (i) ... input buffers, 1 i 2k

OUT (i) ... output buffers, 0 i 1

FRONT (i) ... pointer to first buffer in queue for run i, 1 i k

END (i) ... end of queue for i-th run, 1 i k

LINK (i) ... link field for i-th input buffer

in a queue or for buffer in stack 1 i 2k

LAST (i) ... value of key of last record read

from run i, 1 i k

OU ... buffer currently used for output.

procedureBUFFERING

1fori <-- 1tokdo//input a block from each run//

2 input first block of run i into IN(i)

3end

4whileinput not completedo end//wait//

5fori <-- 1tokdo//initialize queues and free buffers//

6 FRONT(i) <-- END(i) <-- i

7 LAST(i) <-- last key in buffer IN(i)

8 LINK(k + i) <-- k + i + 1 //stack free buffer//

9end

10 LINK(2k) <-- 0; AV <-- k + 1; OU <-- 0

//first queue exhausted is the one whose last key read is smallest//

11 find j such that LAST(j) = min {LAST(i)}

lik

12 l <-- AV; AV <-- LINK(AV) //get next free buffer//

13ifLAST(j) +then[begin to read next block for run j into

buffer IN(l)]

14 repeat //KWAYMERGE merges records from the k buffers

FRONT(i) into output buffer OU until it is full.

If an input buffer becomes empty before OU is filled, the

next buffer in the queue for this run is used and the empty

buffer is stacked or last key = + //

15 call KWAYMERGE

16 while input/output not complete do //wait loop//

17 end

ifLAST(j) +then

18 [LINK(END(j)) <-- l; END(j) <-- l; LAST(j) <-- last key read

//queue new block//

19 find j such that LAST(j) = min {LAST(i)}

lik

20 l <-- AV; AV <-- LINK(AV)] //get next free buffer//

21 last-key-merged <-- last key in OUT(OU)

22ifLAST(j) +then[begin to write OUT(OU) and read next block of

run j into IN(l)]

23else[begin to write OUT(OU)]

24 OU <-- 1 - OU

25untillast-key-merged = +

26whileoutput incomplete do //wait loop//

27end

28endBUFFERING

2) For large *k *the algorithm KWAYMERGE uses a selection tree as discussed in section 8.2.1.

**Example 8.2:** To illustrate the working of the above algorithm, let us trace through it while it performs a three-way merge on the following three runs:

image 404_a_a.gif not available.

image 404_a_b.gif not available.

**Theorem 8.1:** The following is true for algorithm BUFFERING:

(i) There is always a buffer available in which to begin reading the next block; and

(ii) during the *k*-way merge the next block in the queue has been read in by the time it is needed.

Using conventional internal sorting methods such as those of Chapter 7 it is possible to generate runs that are only as large as the number of records that can be held in internal memory at one time. Using a tree of losers it is possible to do better than this. In fact, the algorithm we shall present will on the average generate runs that are twice as long as obtainable by conventional methods. This algorithm was devised by Walters, Painter and Zalk. In addition to being capable of generating longer runs, this algorithm will allow for parallel input, output and internal processing. For almost all the internal sort methods discussed in Chapter 7, this parallelism is not possible. Heap sort is an exception to this. In describing the run generation algorithm, we shall not dwell too much upon the input/output buffering needed. It will be assumed that input/output buffers have been appropriately set up for maximum overlapping of input, output and internal processing. Wherever in the run generation algorithm there is an input/output instruction, it will be assumed that the operation takes place through the input/output buffers. We shall assume that there is enough space to construct a tree of losers for *k* records, *R *(*i*), 0 *i* < *k*. This will require a loser tree with *k* nodes numbered 0 to *k* - 1. Each node, *i*, in this tree will have one field *L *(*i*). *L*(*i*), 1 *i* < *k* represents the loser of the tournament played at node *i*. Node 0 represents the overall winner of the tournament. This node will not be explicitly present in the algorithm. Each of the *k* record positions *R *(*i*), has a run number field *RN *(*i*), 0 *i* < *k* associated with it. This field will enable us to determine whether or not *R *(*i*) can be output as part of the run currently being generated. Whenever the tournament winner is output, a new record (if there is one) is input and the tournament replayed as discussed in section 8.2.1. Algorithm RUNS is simply an implementation of the loser tree strategy discussed earlier. The variables used in this algorithm have the following significance:

R(i), 0 i < k ... the k records in the tournament tree

KEY(i), 0 i < k ... key value of record R(i)

L(i), 0 < i < k ... loser of the tournament played at node i

RN(i), 0 <= i < k ... the run number to which R(i) belongs

RC ... run number of current run

Q ... overall tournament winner

RQ ... run number for R(Q)

RMAX ... number of runs that will be generated

LAST__KEY ... key value of last record output

procedureRUNS

//generate runs using a tree of losers//

1fori1tok-1do//set up fictitious run 0 to initialize

tree//

2RN(i) 0;L(i)i;KEY(i) 0

3end

4QRQRCRMAXRN(0) 0;LAST__KEY

5loop//output runs//

6ifRQRCthen[//end of run//

7ifRC0thenoutput end of run marker

8ifRQ>RMAXthen stop

9elseRCRQ]

//output recordR(Q) if not fictitious//

10ifRQ0then[outputR(Q);LAST__KEYKEY(Q)]

//input new record into tree//

11ifno more inputthen[RQRMAX+ 1;RN(Q)RQ]

12else[input a new record into R(Q)

13ifKEY(Q) <LAST__KEY

14then[//new record belongs to next

run//

15RQRQ+ 1

16RN(Q)RQ

RMAXRQ]

17else[RN(Q)RC]]

//adjust losers//

18T(k+Q)/2 //Tis parent ofQ//

19whileT0do

20ifRN(L(T)) <RQor(RN(L(T)) =RQandKEY(L(T)) <

KEY(Q))

21then[TEMPQ;QL(T);L(T)TEMP

//Tis the winner//

22RQRN(Q)]

23TT/2 //move up tree//

24end

25forever

26endRUNS

**Example 8.3:** A file containing 4,500 records, *R*_{1}, ..., *R*_{4500} is to be sorted using a computer with an internal memory large enough to hold only 800 records. The available external storage consists of 4 tapes, *T*1, *T*2, *T*3 and *T*4. Each block on the input tape contains 250 records. The input tape is not to be destroyed during the sort. Initially, this tape is on one of the 4 available tape drives and so has to be dismounted and replaced by a work tape after all input records have been read. Let us look at the various steps involved in sorting this file. In order to simplify the discussion, we shall assume that the initial run generation is carried out using a conventional internal sort such as quicksort. In this case 3 blocks of input can be read in at a time, sorted and output as a run. We shall also assume that no parallelism in input, output and CPU processing is possible. In this case we shall use only one output buffer and two input buffers. This requires only 750 record spaces.

t_{rew }= time to rewind tape over a length correponding to 1 block

nt_{m }= time to merge n records from input buffers to output buffer using a 2-way merge

Example 8.3 performed a 2-way merge of the runs. As in the case of disk sorting, the computing time here* *too depends essentially on the number of passes being made over the data. Use of a higher order merge results* *in a decrease in the number of passes being made without significantly changing the internal merge time.* *Hence, we would like to use as high an order merge as possible. In the case of disks the order of merge was* *limited essentially by the amount of internal memory available for use as input/output buffers. A *k*-way merge* *required the use of 2*k* + 2 buffers. While this is still true of tapes, another probably more severe restriction on the merge order *k* is the number of tapes available. In order to avoid excessive tape seek times, it is necessary that runs being merged together be on different tapes. Thus, a *k*-way merge requires at least *k*-tapes for use as input tapes during the merge.

procedureM1

//Sort a file of records from a given input tape using ak-way

merge given tapesT_{1}, ...,T1 are available for the sort._{k+}//

1Create runs from the input file distributing them evenly over

tapes T_{1}, ...,T_{k}

2Rewind T_{1}, ...,T_{k}and also the input tape

3ifthere is only1runthen return//sorted file onT_{1}//

4replace input tape by T+1_{k}

5loop//repeatedly merge ontoT+1 and redistribute back_{k}

ontoT_{1}, ...,T_{k}//

6merge runs from T_{1}, ...,T+1_{k}onto T_{k}

7rewind T_{1}, ....,T+1_{k}

8ifnumber of runs on T+1 = 1_{k}then return//output on

T+1//_{k}

9evenly distribute from T+1_{k}onto T_{1}, ...,T_{k}

10rewind T_{1}, ...,T+1_{k}

11forever

12endM1

The only difference between algorithms *M*2 and *M*1 is the redistributing

procedureM2

//same function asM1//

1Create runs from the input tape, distributing them evenly over

tapes T_{1}, ...,T_{k}.

2rewind T_{1}, ...,T_{k}and also the input tape

3ifthere is only 1 runthen return//sorted file is onT_{1}//

4replace input tape by T_{k+1}

5ik + 1 //iis index of output tape//

6loop

7merge runs from the k tapes T_{j}, 1 j k + 1andji onto

T_{i}

8rewind tapes T_{1}, ...,T_{k+1}

9ifnumber of runs on T_{i}= 1then return//output on T_{i}//

10evenly distribute(k- 1)/kof the runs on T, 1_{i}onto tapes T_{i}

jk+1andjiandj imod(k+ 1) + 1

11rewind tapes T, 1_{j}jk+ 1andji

12iimod(k+ 1) + 1

13forever

14endM2

k 2k-way k-way

1 3/2 log_{2 }m + 1/2 --

2 7/8 log_{2}m + 1/6 log_{2}m + 1

3 1.124 log_{3}m + 1/6 log_{3}m + 1

4 1.25 log_{4}m + 1/8 log_{4}m + 1

Algorithm *M*3 performs a *k*-way merge sort using 2*k* tapes.

procedureM3

//Sort a file of records from a given input tape using ak-way

merge on 2ktapesT_{1}, ...,T_{2k}//

1Create runs from the input file distributing them evenly over tapes

T_{1}, ...,T_{k}

2rewind T_{1}, ...,T_{k};rewind the input tape; replace the input tape by

tape T_{2k}; i0

3whiletotal number of runs on T+1, ...,_{ik}T> 1_{ik+k}do

4j1 -i

5perform a k-way merge from T+1, ...,_{ik}T+_{ik}kevenly distributing

output runs onto T_{jk+1}, ...,T_{jk+k}.

6rewind T_{1}, ...,T_{2k}

7ij//switch input an output tapes//

8end

//sorted file is onT+1//_{ik}

9endM3

One should note that *M*1, *M*2 and *M*3 use the buffering algorithm of section 8.2.2 during the *k*-way merge. This merge itself, for large *k*, would be carried out using a selection tree as discussed earlier. In both cases the proper choice of buffer lengths and the merge order (restricted by the number of tape drives available) would result in an almost complete overlap of internal processing time with input/output time. At the end of each level of merge, processing will have to wait until the tapes rewind. Once again, this wait can be minimized if the run distribution strategy of exercise 4 is used.

Balanced *k*-way merging is characterized by a balanced or even distribution of the runs to be merged onto *k* input tapes. One consequence of this is that 2*k* tapes are needed to avoid wasteful passes over the data during which runs are just redistributed onto tapes in preparation for the next merge pass. It is possible to avoid altogether these wasteful redistribution passes, when using fewer than 2*k* tapes and a *k*-way merge, by distributing the "right number" of runs onto different tapes. We shall see one way to do this for a *k*-way merge utilizing *k* + 1 tapes. To begin, let us see how *m* = 21 runs may be merged using a 2-way merge with 3 tapes *T*1, *T*2 and *T3*. Lengths of runs obtained during the merge will be represented in terms of the length of initial runs created by the internal sort. Thus, the internal sort creates 21 runs of length 1 (the length of initial runs is the unit of measure). The runs on a tape will be denoted by *s ^{n}* where

Fraction of

Total Records

Phase T1 T2 T3 Read

1 1^{13}1^{8}-- 1 initial distribution

2 1^{5}-- 2^{8}16/21 merge to T3

3 -- 3^{5}2^{3}15/21 merge to T2

4 5^{3}3^{2}-- 15/21 merge to T1

5 5^{1}-- 8^{2}16/21 merge to T3

6 -- 13^{1}8^{1}13/21 merge to T2

7 21^{1}-- -- 1 merge to T1

Phase T1 T2 T3

n 1 0 0

n - 1 0 1 1

n - 2 1 0 2

n - 3 3 2 0

n - 4 0 5 3

n - 5 5 0 8

n - 6 13 8 0

n - 7 0 21 13

n - 8 21 0 34

n - 9 55 34 0

n - 10 0 89 55

n - 11 89 0 144

n - 12 233 144 0

n - 13 0 377 233

n - 14 377 0 610

Phase T1 T2 T3 T4

n 1 0 0 0

n - 1 0 1 1 1

n - 2 1 0 2 2

n - 3 3 2 0 4

n - 4 7 6 4 0

n - 5 0 13 11 7

n - 6 13 0 24 20

n - 7 37 24 0 44

n - 8 81 68 44 0

n - 9 0 149 125 81

n - 10 149 0 274 230

n - 11 423 274 0 504

n - 12 927 778 504 0

n - 13 0 1705 1431 927

n - 14 1705 0 3136 2632

**Example 8.4:** The initial internal sort of a file of records creates 57 runs. 4 tapes are available to carry out the merging of these runs. The table below shows the status of the tapes using 3-way polyphase merge.

Fraction of

Total Records

Phase T1 T2 T3 T4 Read

1 1^{13}-- 1^{24 }1^{20}1 initial distribution

2 -- 3^{13}1^{11}1^{7}39/57 merge onto T2

3 5^{7}3^{6}1^{4}-- 35/57 merge onto T1

4 5^{3}3^{2}-- 9^{4}36/57 merge onto T4

5 5^{1}-- 17^{2}9^{2}34/57 merge onto T3

6 -- 31^{1}17^{1}9^{1}31/57 merge onto T2

7 57^{1}-- -- -- 1 merge onto T1

**Theorem 8.2:** Any one tape algorithm that sorts *n *records must take time O*(*n^{2}).

**Theorem 8.3: ***n* records can be sorted on 2 tapes in *O*(*n* log *n*) time if it is possible to perform an inplace rewrite of a record without destroying adjacent records. I.e. if record *R*_{2} in the sequence *R*_{1 }*R*_{2 }*R*_{3} can be rewritten by an equal size recod *R'*_{2} to obtain *R*_{1}*R'*_{2}*R*_{3}.

See the readings for chapter 7.

The algorithm for theorem 8.3 may be found in:

**1. **Write an algorithm to construct a tree of losers for records with key values . Let the tree nodes be with a pointer to the loser of a tournament and T_{0} a pointer to the overall winner. Show that this construction can be carried out in time O(k).

**2. **Write an algorithm, using a tree of losers, to carry out a k-way merge of k runs . Show that if there are n records in the k runs together, then the computing time is O(n log_{2}k).

**3. **a) *n* records are to be sorted on a computer with a memory capacity of *S* records (*S << n*). Assume that the entire *S* record capacity may be used for input/output buffers. The input is on disk and consists of *m* runs. Assume that each time a disk access in made the seek time *is t _{s}* and the latency time is

**4. **a) Modify algorithm *M*3 using the run distribution strategy described in the analysis of algorithm *M*2.

**5. **Obtain a table corresponding to Table 8.16 for the case of a 5-way polyphase merge on 6 tapes. Use this to obtain the correct initial distribution for 497 runs so that the sorted file will be on *T*1. How many passes are made over the data in achieving the sort? How many passes would have been made by a 5-way balanced merge sort on 6 tapes (algorithm *M*2)? How many passes would have been made by a 3-way balanced merge sort on 6 tapes (algorithm *M*3)?

**6. **In this exercise we shall investigate another run distribution strategy for sorting on tapes. Let us look at a 3-way merge on 4 tapes of 157 runs. These runs are initially distributed according to: 70 runs on *T*1, 56 on *T*2 and 31 on *T*3. The merging takes place according to the table below.

a) Show that the total number of passes made to sort 157 runs is 6-62/157.

**7.** a) Generate the first 10 rows of the initial distribution table for a 5-way Cascade merge using 6 tapes (see exercise 6).

**8.** List the runs output by algorithm RUNS using the following input file and *k* = 4.

100,50, 18,60,2,70,30, 16,20, 19,99,55,20

**9.** For the example of Table 8.14 compute the total rewind time. Compare this with the rewind time needed by algorthm *M*2.