Evaluating Data-Compression Algorithms

Finding the right tool for the right job

G. Jason Mathews

Jason is a computer engineer at the National Space Science Data Center Interoperable Systems Office. He can be contacted at mathews@nssdc.gsfc.nasa.gov.


The large amount of data collected by NASA and other scientific organizations makes data compression a necessary part of archival and data-management systems. For instance, data compression is being evaluated for incorporation into the Common Data Format (CDF), a data-management package developed and maintained by the National Space Science Data Center of the NASA/Goddard Space Flight Center (GSFC) for storing, manipulating, and accessing multidimensional data sets. CDF is capable of storing any type of data, including scalar data items, vectors, and multidimensional arrays of data.

Regardless of dimensionality, all data types break down into a sequence of bytes that can be fed into a compression function. The diverse data types and high-performance speed requirements necessitate a general-purpose, fast, simple data-compression algorithm. As a member of the team designing and implementing CDF, I compared several data-compression approaches, searching for a suitable, general-purpose algorithm. In this article, I'll present the results of my examinations. While specifically targeted at CDF, the results are applicable to a variety of platforms.

An ideal data-compression algorithm would compress all data at the highest compression rate in the least amount of time. No such algorithm exists, however, since the measurement criteria are both data and application dependent. For example, one developer might want a high compression ratio regardless of time, while another might need fast compression with some sacrifice of compression ratio. There are trade-offs among the following: time to compress and decompress the data, compression rates, memory requirements (hash tables, frequency counts, temporary buffers, and so on), and algorithmic complexity.

CDF is a programming interface for applications whose goal is to provide fast access to the data; therefore, the compression algorithm should be fast. There isn't much need for compression if most of the data cannot be compressed, so we needed a general-purpose compression algorithm that compresses most data with a good compression ratio. The compression ratio is defined as the percentage reduction (1--compressed data size/uncompressed data size) x 100. One hundred percent compression is not possible unless the compressed size is zero, which occurs only if the uncompressed data size is zero as well. Therefore, the upper limit is 100 percent, and the ratios will approach this limit. If the compressed data size is larger than the uncompressed data size, the data expanded with negative compression. Unless you set a lower limit, data can expand infinitely. If you set the lower limit to zero, files that fail to compress will retain a compression ratio of 0 instead of expanding. A good compression ratio is defined as a significant data reduction for all test data as a whole for a particular compression algorithm. Furthermore, the CDF interface is a library and is linked into all CDF applications, so it's also important that the compression algorithm not dramatically increase the application size.

Finally, portability is important, since both the CDF software and data sets are portable to DEC Alpha, DECstation, HP 9000, PC, RS/6000, Macintosh, NeXT, SGI, Sun, and VAX platforms. The hallmark of the CDF concept is its data-set independence, so a compression implementation with minimum architecture- and machine-level dependencies is preferable:

Data Characteristics

CDF can handle data conceptually organized as scalar and multidimensional with arrays of up to ten dimensions. It supports the data types available with C and Fortran compilers on most systems, which include 1-, 2-, and 4-byte signed and unsigned integers, 4-byte single-precision floating point, 8-byte double-precision floating point, and 1-byte signed and unsigned character types. For example, there may be data defined as a single scalar 1-byte integer and another as a 512x512x16 array of double-precision floating-point numbers. Although it is not very productive to compress a single scalar byte, compressing a large array or stream of bytes has vastly different results. Similarly, any data value with two or three bytes cannot be reduced. However, a 4-byte quantity could possibly be reduced to three bytes, so the minimum file size of four bytes is selected. Since most data stored in CDFs typically have fewer than three dimensions (for example, scalar data types, three-element vectors, and two-dimensional images), the sizes of the sample data files are arbitrarily chosen to range from 4 bytes to 144 KB. Although larger files achieve higher compression ratios, smaller data files better represent the typical size of data stored as components within a CDF.

To evaluate a general-purpose compression algorithm, I needed a representative collection of sample data. I turned to the graphics, text, executables, and sound-data samples presented in Dr. Dobb's Journal; see "DDJ Data Compression Contest" (DDJ, February 1991) and "DDJ Data Compression Contest Results," by Mark Nelson (DDJ, November 1991). While it is unlikely that MS-DOS executable programs will be stored in CDFs, these files have their own structure, and since the goal was to find a general-purpose compressor for all types of data, I included all of the files used by the DDJ tests. In addition, a subset of scientific data from the International Solar-Terrestrial Physics (ISTP) Key Parameter Data CD-ROM was included, since it represents the typical values and types of data that CDF actually stores (data for ISTP spacecraft orbit, attitude, magnetic field and particle measurements, and images).

The compression programs I evaluated included implementations for variations of the LZ77, LZ78, LZW, Huffman, run-length encoding, and arithmetic compression algorithms. Table 1 lists the programs included for the MS-DOS compression tests.

CHURN, the compression utility program that accompanies Mark Nelson's Data Compression Book, Second Edition (M&T Books, 1995), is capable of running compression and decompression programs on all data files in a specified disk volume for MS-DOS. CHURN also measures the elapsed compression and decompression times, and verifies that the contents of the decompressed files match the original data files. CHURN is called with three arguments: a drive letter and pathname to recursively search for files to compress, a compression command, and an output filename. The compression command tells CHURN how to compress the input file to a file called TEST.CMP. CHURN executes the compression command by passing the command line to DOS using the system() function call. The program inserts the filename into the compression command by calling sprintf() with the filename as an argument. This means that if the compression command has a %s anywhere in it, the name of the input file will be substituted. Finally, the third argument on the command line is the command that CHURN needs to decompress TEST.CMP to TEST.OUT; for example, CHURN D:\DATA\ "LZSS-C %%s test.cmp" "LZSS-D test.cmp test.out". The double % symbols defeat variable substitution under some command-line interpreters such as 4DOS. A more complicated example which tests PKZIP might be CHURN D:\DATA\ "PKZIP.BAT %%s" "PKUNZIP TEST.CMP", where PKZIP.BAT has two lines:

COPY %1 TEST.OUT
PKZIP -M TEST.CMP TEST.OUT
CHURN also creates a file called CHURN.LOG containing a summary of compression results. This can be used for further analysis by other programs. This file is reformatted and used to generate the graphic output in Figure 1, where the numbers correspond to those in Table 1. CHURN and the corresponding compression programs were tested on a 486/33 PC running MS-DOS 6.20.

The DOS COPY command (item 14 in Table 1) measures the baseline time to copy a file byte-by-byte with no compression. No compression program should be able to read the entire input file, analyze the data, and compress the results in less time than a straightforward copy, but some compression programs actually approach this time with significant compression. Therefore, the overhead for compressing data won't impact performance if the faster compression algorithms are used.

The DCT program appears to have the highest compression rate, but this is a lossy algorithm included in the test suite only for comparison since lossy algorithms do not meet the requirement for full reconstruction of the data. However, the COMPRESS, GZIP, LZRW1, and LZRW3A programs all achieve high compression rates with fast execution times.

I also conducted tests on the six algorithms listed in Table 2, using the Computerized Reduction Using Selected Heuristics (CRUSH) data-compression program. CRUSH is a portable, multimethod utility that incorporates different algorithms into a common interface for VMS and UNIX systems. CRUSH takes a different approach than CHURN: It links different compression and decompression algorithms into one program, rather than calling external programs. CRUSH is flexible because you can either compress files using a compression algorithm or select automatic mode, which tries all algorithms and selects the best method. Automatic mode ensures that the data have the maximum compression given the available algorithms, but at a high cost of time. The algorithms included in CRUSH were developed by Ian H. Witten, Radford M. Neal, and John G. Cleary (Department of Computer Science, University of Calgary, Canada); David M. Abrahamson (Trinity College, Dublin, Ireland); Ross N. Williams (Renaissance Software, Adelaide, Australia); Robert F. Rice (JPL); and Pen-Shu Yeh and Warner Miller (GSFC). The source code for CRUSH (which is in the public domain) is available via anonymous ftp from dftnic.gsfc.nasa.gov (128.183.115.71) in the software/unix/crushv3 directory.

I ran tests on a Sun 4 machine under SunOS 4.1. This machine is not dedicated, so the elapsed times may have been slightly affected by other activity on the system. To minimize this, the tests were run at off-hours. CRUSH compresses each data file using all six algorithms. For tests of an individual data file, the best compression algorithm varied; the overall results are in Figure 2. LZC has the highest compression ratio and one of the lowest times. The next-best class of algorithms with a high compression/time ratio is LZRW3A and LZRW1. WNC and ADAP both offer comparable compression ratios, and ADAP is even slightly better than LZC, but both take much longer. Overall, the RICE algorithm does not work as well as a general-purpose compressor, but it works best on several individual files.

CRUSH has a single interface, with each compression algorithm having equivalent functionality, so the code complexity can be estimated using the size of the corresponding object-code files. This is a rough approximation, but, along with the other measurements, it gives some idea of how much is involved in the algorithm. Since the compression programs tested on the IBM PC each have different functionality, error handling, and other bells and whistles, the object and executable sizes are not very reliable measures of their complexity. For this reason, all complexity estimates will include only the algorithms available through the CRUSH interface, which has the equivalent interface.

Table 3 ranks compression ratio, speed, and complexity from 1 to 4 (1=worst in its category; 4=most desirable and best in its category). The compression-ratio and speed (total elapsed time) rankings are extracted from the results of running CHURN with the test data and scaled appropriately.

The overall score is a weighted sum (10 x compression rank + 10 x time rank + 5 x complexity rank) with a maximum of 100. The complexity weight is lower than the other two because it is the least reliable measurement. The weights can be adjusted according to the measurements' priority.

Another study that compares 47 compression programs over various large collections of data was conducted by Greg Flint of Purdue University. The raw test results are available as arctst03.zip via anonymous ftp from any SimTel archive site, such as oak.oakland.edu (141.210.10.117) in the SimTel/msdos/info directory. Upon preliminary examination these results were consistent with those presented in this article.

Results

According to my tests, it appears that the best class of programs is based on the Lempel-Ziv adaptive dictionary-based algorithm. Arithmetic compression has a high compression ratio but is the slowest. GZIP and LZC (UNIX Compress) achieve the highest compression ratios and lowest elapsed time. They are the best external, stand-alone compression programs, and they are portable to MS-DOS, UNIX, VMS, and Macintosh platforms. The two algorithms that incorporate small, simple functions into a much larger data-management package are LZRW1 and LZRW3A.

Unfortunately, the LZC implementation, which is based on the LZW algorithm, is covered by Unisys patents (U.S. 4,464,650 and 4,558,302). While Unisys has stated it will license the algorithm to modem manufacturers, the company has not expressly said how the patent applies to other products. This is one reason for the CompuServe GIF/Unisys controversy. The LZRW algorithms are also covered by one or more patents. (The author of these two algorithms believes that the patents are discouraging implementation of the compression techniques.) The GZIP algorithm (a variation of LZ77), however, is free software and is free of patents. GZIP is fast and good but the code is neither simple nor straightforward for incorporating into a large data-management package.

CDF is distributed worldwide as public-domain software and as part of several commercial data-analysis and visualization packages, such as RSI's Interactive Data Language (IDL) and IBM's Visualization Data Explorer; thus, there may be legal complications that are beyond the scope of this article. Aside from these issues, these four LZ-based algorithms (GZIP, LZC, LZRW1, and LZRW3A) are recommended as good, general-purpose data- compression algorithms.

Future Plans

The future plans for CDF are to incorporate the LZC and possibly another data-compression algorithm into a test version of the CDF data-management package, where compression and decompression will be transparent to CDF applications. Various techniques will be tried to determine what to compress to allow efficient access and minimize the impact on performance.

References

Goucher, G.W. and G.J. Mathews. A Comprehensive Look at CDF, NSSDC/WDCA-R&S 94-07. August 1994. (http://nssdc.gsfc.nasa.gov/cdf/cdf_home.html).

Nelson, Mark R. "DDJ Data Compression Contest Results." Dr. Dobb's Journal (November 1991).

------. The Data Compression Book, Second Edition. San Mateo, CA: M&T Books, 1995.

Ross, E., "A Simple Data Compression Technique." C/C++ User's Journal (October 1992).

Teague, M. "ISTP Key Parameter Available on CD-ROM." Solar-Terrestrial Energy Program (STEP) International, Vol. 4, No. 7. July 1994.

Williams, R.N. "An Extremely Fast Ziv-Lempel Data Compression Algorithm." Proceedings Data Compression Conference '91.

Ziv, J. and A. Lempel. "A Universal Algorithm for Sequential Data Compression." IEEE Transactions on Information Theory, Vol. 23, No. 3, May 1977.

Table 1: Data-compression programs tested with CHURN. *Source/executables available on SimTel anonymous ftp sites (for example, oak.oakland.edu). **Source available on disks provided with Mark Nelson's Data Compression Book.

     Item     Program              Description

       1      ahuff**              Adaptive Huffman coding
       2      ar**                 Haruhiki Okumura's archiver
       3      arith**              Arithmetic coding
       4      arith1**             Order-1 arithmetic coding
       5      arj*                 Robert Jung's ARJ, v2.41a
       6      arth1e**             Order-1 arithmetic coding
       7      cmp**                Linear-conversion scheme for sound
       8      compress*            UNIX compress uses LZW variant
       9      cmp40-14             Compress w/COMP40 option + 12 bit
      10      comp40               Compress with COMP40 option
      11      comprs12             Compress 12 bit
      12      comprs14             Compress 14 bit
      13      comp_opt             Modified/optimized compress
      14      copy     			   MS-DOS COPY command
      15      dct**                Discrete cosine transformation
      16      dogzip*              gzip-like LZ77 variant
      17      doz*                 UNIX-like LZW
      18      gzip*                Gnu ZIP uses LZ77 variant
      19      huff*                Huffman coding
      20      lha*                 Harayasu Yoshizahi's LHA v2.13
      21      lzari*               LZSS with adaptive arithmetic coding
      22      lzhuf*               LZSS with adaptive Huffman coding
      23      lzrw1                Ross Williams's LZ variant
      24      lzrw3a               Ross Williams's LZ variant
      25      lzss*                Storer-Szymanski modified LZ77
      26      lzss**               LZ77 with 12-bit sliding window
      27      lzw12**              12-bit LZW compression
      28      lzwcom*              Kent Williams's LZW variant
      29      pkarc*               PKARC v3.6
      30      pkzip (default)*     PKZIP v2.04g LZ77 variant
      31      pkzip -ex*           PKZIP using maximal compression
      32      rdc                  Ross Data Compression uses RLE
      33      snd**                Silence compression coding
      34      zoo*                 Rahul Dhesi's Zoo archiver, v2.1
 

Table 2: Data-compression algorithms tested with CRUSH.

     Method Name     Algorithm Description

     ADAP            Adaptive dependency WNC method.
     LZC             Lempel/Ziv/Welch UNIX compress.
     LZRW1           Fast Lempel-Ziv method.
     LZRW3A          Fast Lempel-Ziv method.
     RICE            Rice machine.
     WNC             Witten/Neal/Cleary arithmetic code.

Table 3: Ranking of six algorithms tested with CRUSH.

         Compression                 Complexity       Overall Score
Method      Ratio         Speed    (Obj. Code Size)     (Max=100)

LZC           4             3             2                 80
LZRW1         2             4             4                 80
LZRW3A        3             3             3                 75
ADAP          4             1             3                 65
WNC           2             1             2                 40 
RICE          1             2             2                 40

Figure 1: CHURN compression results for 30 programs.

Figure 2: CRUSH compression results.

Back to Table of Contents