Previous Page
Next Page

8.1. Sorting

Don't recompute sort keys inside a sort.

Doing expensive computations inside the block of a sort is inefficient. By default, the Perl interpreter now uses merge-sorting to implement sort[*], which means that every sort will call the sort block O(N log N) times. For example, suppose you needed to set up a collection of script files for binary-chop searching. In that case, you might need to sort a set of scripts by their SHA-512 digests. Doing that the obvious way is needlessly slow, because each script is likely to be re-SHA'd several times:

[*] Prior to 5.8, the quicksort algorithm was used. Quicksort's average performance is slightly better than mergesort for small lists, but its worst-case performance can be significantly worse. With the move to mergesort, Perl's default sorting behaviour is now also "stable". That is, it preserves the ordering of list elements that are equivalent.

    Use Digest::SHA qw( sha512 );

    # Sort by SHA-512 digest of scripts
    @sorted_scripts
        = sort { sha512($a) cmp sha512($b) } @scripts;

Digests

SHA-512 is one of a family of cryptographic hash functions. It's something like Perl's built-in crypt function, but much more robust.

A hash function takes an input text of any size (or, sometimes, any size up to some ridiculously huge limit) and returns a sequence of bitsa digestthat identifies the original text. Such functions are also referred to as one-way hash algorithms, because a given input text always produces the same digest, but a digest cannot be reverse-engineered to recover the original text.

Cryptographic hash functions are also designed so that a prohibitive amount of computation would be required to discover two different input texts that produce the same digest.

Together, these properties ensure that two people can reliably compare the contents of two text files, without revealing those contents to each other or to any third party, simply by comparing the digests of the two files.

See the documentation of the standard Digest module for an overview of the numerous cryptographic hash functions available in Perl.


A better solution is to cache the SHA values you've computed and avoid recalculating them. There are various standard techniques for doing that. The most straightforward is known as the Orcish Man?uvre[*]:

[*] Joseph Hall, who invented the technique, dubbed it "Orcish" because it "ORs a cache". Aren't you sorry you asked?


    

    # Sort by SHA-512 digest of scripts
    # (optimized with an on-the-fly key cache)
@sorted_scripts = do { my %sha512_of; sort { ($sha512_of{$a} ||= sha512($a)) cmp ($sha512_of{$b} ||= sha512($b)) } @scripts; };

The sort block uses a hash (%sha512_of) to record each digest it computes. So, when comparing two scripts, it checks to see whether it already knows their SHA-512 values ($sha512_of{$a} cmp $sha512_of{$b}). If either look-up fails to find a cached value for the digest of its key, it returns undef instead, in which case the ||= will be evaluated. The sha512( ) call to its right then computes the appropriate value, which is assigned to the corresponding entry of the look-up table, for future reference.

Note the use of a do block to limit the scope of the %sha512_of cache to the particular call to sort. If two or more sorts were likely to be applied to some of the same scripts, it would be more efficient to preserve the cache between sorts by moving the declaration of %sha512_of to some suitable outer scope, in which case the do block would no longer be required:


    

    # Declare cache
... my %sha512_of;
# and later
...
# Sort by SHA-512 digest of scripts
    # (optimized with an on-the-fly key cache)
@sorted_scripts = sort { ($sha512_of{$a} ||= sha512($a)) cmp ($sha512_of{$b} ||= sha512($b)) } @scripts;

Alternatively, rather than building the cache on the fly, you could precompute all the digests and store them in a look-up table via a sliced assignment:


    

    # Sort by SHA-512 digest of scripts
    # (optimized with a precomputed key cache)
my %sha512_of; @sha512_of{@scripts} = map { sha512($_) } @scripts; @sorted_scripts = sort { $sha512_of{$a} cmp $sha512_of{$b} } @scripts;

The resulting code is much cleaner and just as efficient, unless there is a chance that the sort block might throw an exception and prematurely abort the sorting, in which case you'll have done some of that precomputation needlessly.

Another approach would be to prebuild a list of digest/script pairs, sort on the digests, and then keep only the original scripts:


    

    # Sort by SHA512 digest of scripts
    # (optimized with the Schwartzian Transform)
@sorted_scripts = map { $_->[0] }
# 3. Extract only scripts
sort { $a->[1] cmp $b->[1] }
# 2. Sort on digests
map { [$_, sha512($_)] }
# 1. Precompute digests, store with scripts
@scripts;

This pipelined solution is known as the Schwartzian Transform. Note the special layout, with the three steps lined up under each other. This format is used because it emphasizes the characteristic map-sort-map structure of the transform, making it much easier to identify when the technique is being used.

Probably the cleanest and most maintainable solution (albeit slightly slower than direct caching or precomputation) is to memoize the sha512( ) subroutine itself:


    use Digest::SHA qw( sha512 );

    
# Make the SHA-512 digest function self-caching
... use Memoize; memoize('sha512');
# Sort by auto-cached SHA-512 digest of scripts
... @sorted_scripts = sort { sha512($a) cmp sha512($b) } @scripts;

Memoizing a subroutine causes it to remember every value it ever returns and to immediately return that same value (without recomputing it) the next time the subroutine is called with the same arguments.

The Memoize module is standard in Perl 5.8 and later, and available from the CPAN for earlier versions of Perl. It's a very clean way to introduce caching into your code without littering that code with the ugly mechanics of an explicit cache. See Chapter 19 for a more detailed discussion of caching and memoization.

All of the previous solutions have distinct performance characteristics and trade-offs, which are likely to vary depending on the size of the list to be sorted, the complexity of the function you're using to compare keys, and even the platform you're running on. In some cases, it might even be faster to just leave the (re-)computations in the sort block. So, if you decide to optimize your sort using one of these techniques, it's vitally important to benchmark the performance of whichever approach you decide to use. See Chapter 19 for more discussion of benchmarking. The "Automating Sorts" guideline later in this chapter suggests another easy way to build optimized sorts.

    Previous Page
    Next Page