*Dr. Dobb's Journal* February 1998

Many programmers are familiar with the theoretical result that "no sorting algorithm can be faster than *O(n log n)* on average." But this result assumes that you only have limited information; it assumes you only know how to compare two items. However, what if your keys are the integers from 0 to 100? In that case, you can simply place each item directly into the correct location -- no comparisons are necessary. There are many other situations where you can guess the final position of each item. This month, Karl shows how this basic idea can be used to fashion an *O(n)* sorting algorithm.

-- Tim Kientzle

Most discussions of sorting algorithms are based on comparing elements. Although it's faster to classify elements, algorithms based on classification suffer because they require considerable auxiliary memory. The Flashsort1 algorithm I present here addresses this problem. It sorts in *O(n)* time and requires less than 0.1*n* auxiliary memory to sort *n* elements. I achieve this by using classification to do the long-range ordering with in-place permutation, then use a simple comparison method for the final short-range ordering of each class. I assume here that the elements are approximately evenly distributed real numbers, although the algorithm can be adapted to other types of data.

The algorithm (see Listing One) consists of three logical blocks: classification, permutation, and straight insertion. Classification determines the size of each class of elements. Permutation does long-range reordering to collect elements of each class together, and straight insertion does the final short-range ordering.

The whole idea of a classification sort is simple: Assuming the elements *A(i)* are about evenly distributed, you can compute the approximate final position directly from the element value, with no comparisons. If the maximal element is *Amax* and the minimal element is *Amin*, you can compute:

K(A(i))=1+INT((m-1)(A(i)-Amin)/(Amax-Amin))

The result will be a value from 1 to *m*, which I'll call the class of *A(i)*. Note that, on average, you have approximately *n/m* elements in each class. (In fact, the class *K*=*m* only has elements equal to *Amax*, and the remaining classes are slightly larger.)

Since the main part of the algorithm relies on it, I start by computing the actual number of elements in each class. This requires simply scanning the data and counting the total number of elements.

Since I know the size of each class, I can predict where each class will appear in the final array. The vector *L *is used to track this information. Initially, each *L(K) *indicates the upper end of the section that will contain the elements in class *K*. In particular, *L*(1) is simply the number of elements in class 1, and the final *L(m) *is equal to *n*. As I rearrange things, each *L(K) *is updated.

The first part of the classification is *O(n), *since it requires scanning the input to find the size of each class. Then there's an *O(m) *part, which adds up the class sizes to determine the initial *L(K) *values.

The next step is to move elements into the correct class. Obviously, whenever I move one element into its new location, I have to move the old element from that location. My code uses a single temporary variable to hold the evicted element. I can then compute its new location and continue until I'm forced to stop.

The first problem is determining where to put each element. Rather than try to place an element *A(i)* exactly into its final location, I simply compute its class index *K* and place it into the location indicated by *L(K)*. I then decrement *L(K)*. As I move elements into their appropriate classes, these class pointers grow downward.

This process ends when the first class has been filled up. At that time, there are generally still elements that need to be moved because permutations usually consist of several cycles. Therefore, the second problem arises: How do you find those elements that are cycle leaders? Fortunately, it's easy to find such elements, since they're located in the wrong class. I keep a variable *j* in my implementation and use *j* to search through the array to find the next item that satisfies the condition *j<L(K(A(j)))*. The corresponding item *A(j)* serves as the new cycle leader.

Each item is moved exactly once. I can simply count the total number of moves and stop when I've moved every element. (In fact, I can stop before that; if there's only one element unmoved, it must already be in the right place.)

This process is all *O(n)*. Each item is moved once, and *j* scans through the entire array at most once.

Once the permutation step has moved everything into approximately the correct place, you have a partially sorted array to work with. Following Robert Sedgewick (see "Implementing Quicksort Programs," *Communications of the ACM* 21, 1978), I use a straight-insertion sort over the entire array. Assuming the classes are about the same size, this requires *O((n/m)*^{2}*)* time for each class.

By adjusting the number of classes, you can fine tune the overall running time of the algorithm. If you decrease the number of classes, the classification time decreases, but the time for the straight-insertion sort increases. To estimate the optimal number of classes, I start by estimating the total time of the algorithm:

- The permutation step is
*O(n)*, as is part of the classification. This contributes*a*1*n*to the total time. - The rest of the classification is
*O(m)*, contributing*a*2*m*to the total. - The straight-insertion sort contributes
*a*3*m*(*n/m*)^{2}.

Here, *a*1, *a*2, and *a*3 are some numbers that I'll determine later.

I can use simple calculus to determine the optimum value. From differentiating *t*=*a*1*n*+*a*2*m*+*a*3*m*(*n/m*)^{2}, I get *dt/dm*=*a*2-*a*3(*n/m*)^{2}=0. Solving for *m*, I have *m/n*=(*a*3/*a*2). In particular, I don't need to estimate *a*1.

By assuming shifts were twice as expensive as comparisons, Nicolas Wirth ("Algorithm and Datenstrukturen," B.G. Teubner, 1983) arrived at *a*3=3/8 (using some arbitrary unit of time) for the time constant of straight-insertion sorting. Following his lead, I estimated *a*2=2, which gives an optimal *m* of 0.43*n*.

Figure 1 shows experimental values for the time of the various procedures as a function of *m/n*. As expected, the permutation part takes the same amount of time, regardless of *m*. Between about *m*=0.2*n* and *m*=0.5*n*, there is a broad minimum of the overall time. For larger classes, the straight insertion becomes excessive; and for smaller classes, it takes longer to perform the initial classification. In particular, if *m*>*n*, then many classes are empty, and straight insertion gains you no additional time.

Figure 2 shows the experimental frequency distribution of the run time for the different procedures at fixed *n* and fixed *m/n*. The time needed for classification is, of course, constant. The permutation time varies because of the varying time required to locate the next cycle leader. The straight-insertion sort varies depending on the size of each class. The distribution for the total run time reflects the combination of these two distributions.

The total time for Flashsort1 does not depend on the original order; partially ordered or reverse ordered elements are sorted just as fast as randomly ordered ones. However, if the values are not evenly distributed, the run time can increase unpredictably. One way to address this may be to use another classification and permutation step to sort large classes before the final straight insertion.

Figure 3 shows some experimental results on the relative run times. The separate curves indicate the running time of Flashsort1, Heapsort, and Quickersort using sequences of uniform random numbers as test numbers. As expected, the run time of Flashsort1 increases linearly with *n*. For *m*=0.1*n*, Flashsort1 is always faster than Heapsort, and is faster than Quickersort if *n*>80. For *n*=10000, Flashsort1 is approximately twice as fast as Quickersort. For *m*=0.2*n*, Flashsort1 is about 15 percent faster than for *m*=0.1*n*; for *m*=0.05*n*, it is about 30 percent slower, but if *n*>200, it is still significantly faster than Quickersort.

Flashsort1 sorts by permutation in-situ in *O(n)* time with only a short auxiliary pointer vector. In "Mathematical Analysis of Algorithms," Donald Knuth remarked "...that research on computational complexity is an interesting way to sharpen our tools for the more routine problems we face from day to day." As he points out, time effective in-situ permutation is inherently connected with the problem of finding the cycle leaders, and in-situ permutation could easily be performed in *O(n)* time if we were allowed to manipulate *n* extra "tag" bits specifying how much of the permutation has been carried out at any time. Without such tag bits, he concludes "it seems reasonable to conjecture that every algorithm for in-situ permutation will require at least *n log n* steps on the average." In this article, I showed how *n* elements can be put into *m* property classes with *m* being only a small fraction of *n*. Instead of *n* tag bits, I need only an auxiliary vector of length *m* to sort the elements into *m* classes. As a final step, small numbers of partially distinguishable elements are sorted locally within their classes by a conventional sort algorithm.

I'd like to express my thanks to Christopher McManus of the Veterans Affairs Medical Center in Washington, D.C., for stimulating discussions and helpful comments on the manuscript.

Duccin, F. "Tri par addressage direct," *R.A.I.R.O. Informatique/Computer Science* 13, (1979), 225-259.

Feurzeig, W. "Algorithm 23: Math Sort," *Communications of the ACM* 3, (1960), 601.

Gamzon, E., C.F. Picard. "Algorithme de tri par addressage direct," *C. R. Acad. Sc*. Paris 269, (1969), 38-41.

Knuth, D.E. "Mathematical Analysis of Algorithms," *Information Processing* 71, North Holland Publishing, 1972.

Macleod, I.D.G. "An Algorithm for In Situ Permutation," *The Australian Computer Journal* 2, (1970), 16-19.

Scowen, R.S. "Algorithm 271: Quickersort," *Communications of the ACM* 8, (1965), 669-670.

Williams, J.W.J. "Algorithm 232: Heapsort," *Communications of the ACM* 7, (1964), 347-348.

**DDJ**

SUBROUTINE FLASH1 (A. N. L. M)C SORTS ARRY A WITH N ELEMENTS BY USE OF INDEX VECTOR L C OF DIMENSION M WITH M ABOUT 0.1 N. C COPYRIGHT (C) K. - D. NEUBERT 1997 DIMENSION A(1),L(1) C ============================ CLASS FORMATION ===== ANMIN=A(1) NMAX=1 DO I=1,N IF( A(I).LT.ANMIN) ANMIN=A(I) IF( A(I).GT.A(NMAX)) NMAX=I END DO IF (ANMIN.EQ.A(NMAX)) RETURN C1=(M - 1) / (A(NMAX) - ANMIN)) DO K=1,M L(K)=0 END DO DO I=1,N K=1 + INT(C1 * (A(I) - ANMIN)) L(K)=L(K) + 1 END DO DO K=2,M L(K)=L(K) + L(K - 1) END DO HOLD=A(NMAX) A(NMAX)=A(1) A(1)=HOLD C =============================== PERMUTATION ===== NMOVE=0 J=1 K=M DO WHILE (NMOVE.LT.N - 1) DO WHILE (J.GT.L(K)) J=J + 1 K=1 + INT(C1 * (A(J) - ANMIN)) END DO FLASH=A(J) DO WHILE (.NOT.(J.EQ.L(K) + 1)) K=1 + INT(C1 * (FLASH - ANMIN)) HOLD=A(L(K)) A(L(K))=FLASH FLASH=HOLD L(K)=L(K) - 1 NMOVE=NMOVE + 1 END DO END DO C ========================= STRAIGHT INSERTION ===== DO I=N-2,1,-1 IF (A(I + 1).LT.A(I)) THEN HOLD=A(I) J=I DO WHILE (A(J + 1).LT.HOLD) A(J)=A(J + 1) J=J + 1 END DO A(J)=HOLD ENDIF END DO C =========================== RETURN,END FLASH1 ===== RETURN END