ONE-WAY HASH FUNCTIONS

Using cryptographic algorithms for hashing

Bruce Schneier

Bruce has an MS in Computer Science and has worked in computer and data security for a number of public and private concerns. He can be reached at 730 Fair Oaks Ave., Oak Park, IL 60302.


Pattern matching is usually a problem more suited for elementary programming classes than Dr. Dobb's Journal. But what if the strings are 100 Kbytes large? What if you have to search through a million of them looking for matches? And what if you have to try and match 1000 of these strings each day? With a simple string comparison algorithm, you will have to compare one billion strings, each requiring from 1 to 100,000 individual byte comparisons. And don't forget the hellacious storage and retrieval requirements. There has to be a better way.

Probabilistic algorithms are useful for jobs like this. They work -- most of the time -- and they have been applied to finding large random prime numbers, solutions to particularly convoluted graph theory problems, and pattern matching. The algorithm I am going to present only fails in one out of 2{128} (that's about 3 x 10{38}) tries. There isn't a piece of commercial software on the market that fails less often. You're much more likely to get a disk error reading the program, or a power surge that fries your whole system, than you are to have this algorithm fail. By the time this algorithm fails, our species will most likely have long killed itself off, or at least mutated into some different species that has better things to do with its time than sit around comparing a million 100-Kbyte strings.

We're going to take the computer analog of fingerprints; something small that uniquely identifies something much larger. Hash functions, which are functions that take a large input string and produce a smaller output string, perform the task nicely.

Hash Functions

One of the simplest hash functions is to take the XOR of every 128-bit block in the file, as in Example 1. This works well with random data; each 128-bit hash value is equally likely. However, with more rigidly formatted data, such as ASCII, the results are less than ideal. In most normal text files, the high-order bit of each byte is always 0. Therefore, 8 bits in the hash will always be 0. This means that the effective number of possible hashes is far less than 2{128} (or at least that each possible hash value is not equally likely), and different files are more likely to produce identical hashes.

Example 1: A simple hash function that uses the XOR of every 128-bit block in the file

  main (int argc, char *argv[]) {
      unsigned long hash[4] = {0, 0, 0, 0}, data [4];
      FILE *fp;
      int i;
      if ((fp == fopen (argv [1], "rb")) ! = NULL) {
         while ((fread (data, 4, 4, fp) ! = NULL)
            for (i=0; i<4; i++)

                hash[i] ^= data[i];
         fclose (fp);
         for (i=0; i<4; i++)

              printf ("%081x", hash [i]);
         printf ("\n");
      }
  }

An easy way to get around this is to roll the hash value 1 bit after each XOR. This way each bit is more likely to be throughly randomized by the time the entire file has been hashed, regardless of how the data looks beforehand. The program in Example 2 rolls each longword 1 bit to the right.

Example 2: Rolling each longword 1 bit to the right

      main (int argc, char *argv[]) {
         unsigned long hash [4] = {0, 0, 0, 0}, data [4];

         FILE *fp;
         int i;
         if ((fp == fopen (argv[1], "rb")) != NULL) {
          while ((fread (data, 4, 4, fp) != NULL)
              for (i=O; i<4; i++) {
              hash[i] ^= data[i];
              hash[i] = hash[i]>>1 ^ hash[i]<<31;
               }
          fclose (fp);
          for (i=0; i<4; i++)
               printf ("%081x", hash[i]);
          printf ("\n");

If the hash function is only going to deal with ASCII data, a further improvement might be to ignore spaces. This way, "HELLO WORLD" hashes to the same value as "HELLOWORLD." Of course, it also means that "The house holds one at seven bells" hashes to the same value as "The household son eats even bells," but nothing is perfect. Maybe the program should count the first space, but ignore any additional ones. Additional modifications that ignore punctuation and case might also be useful, or they might be detrimental. It all depends on the type of data involved.

These hash functions work great as long as there isn't a malicious user trying to gum up the works. Creating a file that hashes to a particular value is very easy with any of these functions. It's only slightly harder with more complicated functions. Usually, all that is required is to calculate the hash of any particular file, and then append a 128-bit string at the end of it to force the new file to hash to whatever value you'd like.

In certain applications this problem is very serious, but it can be solved by using a "one-way hash function." By "one-way," I mean that given an input file, it is very easy to calculate the hash value. However, given a particular hash value, it is, for all practical purposes, impossible to produce an input file that hashes to that value. The function has a trap door: It's easy to go one way, but hard to go back the other unless you know the "secret."

The MD5 Algorithm

In theory, any cryptographic algorithm can be used as a one-way hash function. For example, the Digital Encryption Standard (DES) algorithm (see the "C Programming" column, DDJ, September 1990), endorsed by the National Security Agency and used by countless government agencies and business interests, works quite well. Encrypt the entire file using the DES block-chaining mode, and the two final 64-bit encrypted blocks make up the hash. (ANSI X 9.9 suggests using only the last 32 bits of the last encrypted block, but that seems inadequate.) It is random, not reversible, and every bit of the hash is dependent on every bit of the input file. But in software, DES is slow; the algorithm is not at all optimized for this task. And even worse, if the DES key is compromised, then everyone knows the secret to the trap door and the one-way property of the hash function is lost.

An alternate algorithm, shown in Listing One (page 150), is called MD5. (Listing Two, page 151, is the header file for MD5.) Cryptographer Ron Rivest of MIT (he's the "R" in the RSA public-key algorithm) invented MD5 specifically as a one-way hash function. His personal needs for the algorithm revolve more around secure digital signatures than pattern matching, but we will come back to that later.

MD5 produces a 128-bit hash (or "Message Digest," hence the name) of an input file. The algorithm was designed for speed, simplicity, and compactness on 32-bit architectures. It is based on a simple set of primitive operations (that is, LOAD, ADD, XOR) on 32-bit operands. The algorithm also works on 8- and 16-bit microprocessors, albeit significantly slower.

The heart of the algorithm is MD5Update, which processes the input file in 512-bit chunks. Transform performs the basic MD5 algorithm. MD5Init kicks off the whole process. MD5Final is primarily concerned with the partial block at the end of the input file (only the rare input file can be divided exactly into 512-bit blocks). The final block is padded with a single 1 and a string of 0s, with the last word reserved for a binary representation of the number of bits in the original input file. After the algorithm churns away, the resultant hash is stored in mdContext.

It is straightforward to implement MD5 to solve the pattern matching problem at the beginning of this article. First, create and store a hash of each of the million 100-Kbyte input strings. Each one is only 16 bytes long, so you only need 16 Mbytes of storage space instead of 100 gigabytes. Every time you have a new test string, compare its hashes with those stored. Not only are the storage requirements four orders of magnitude smaller, you only have to make one 128-bit comparison per string pair. You'll get a false match every 10{27} years or so, so you might want to add code that explicitly checks strings that hash to the same value. I wouldn't even bother.

Putting One-Way Hash Functions to Work

One-way hash functions can be used to detect viruses, which make their living by surreptitiously modifying programs. Numerous security programs take simple hashes called Cyclic Redundancy Checks (CRCs). These programs hash clean files, and then compare the stored hash with a new hash of the file. If a virus infected the file, its hash would be different. The problem with most of these virus-detection programs is that a clever virus can make sure any virus-infected file hashes to the same value as the clean file. It would be harder to do this with CRC fingerprinters than with the simple hash programs listed earlier, but it would still be possible. MD5 makes it practically impossible. Even if the virus had complete knowledge of the MD5 algorithm, it could not modify itself so that the infected program would hash to the same value. All it could do would be to randomly append bits to the file and try to duplicate the hash by brute force. But even if we allow the virus to try a million possible hashes per second, it would take 10{25} years for it to successfully infect the file. I'm willing to live with the possibility of one successful virus infection on my system between now and when the sun goes nova.

As I mentioned, one-way hash functions have uses in digital signatures. A digital signature scheme is where one party can "sign" a document in such a way that no one except the signer could sign the document, and anyone can verify that the signer actually did sign the document. Most digital signature schemes involve public-key cryptography; they are computational nightmares and often painfully slow. Speed increases drastically if the digital signature algorithm operates on a hash of the document rather than the document itself. Because the chances of two documents having the same hash are only 1 in 2{128}, anyone can safely equate a signature of the hash with a signature of the document. If a conventional hash function were used, it would be a trivial matter to invent multiple documents that hashed to the same value, so that anyone signing a particular document would be duped into signing a multitude of documents. The protocol just could not work without one-way hash functions.

One-way hash functions can also be used by an archival system to verify the existence of documents without storing their contents. The central database could just store the hashes of files. It doesn't have to see the files at all; users submit their hashes to the database, and the database time stamps the submissions and stores them. If there is any disagreement in the future about who created a document and when, the database could resolve it by finding the hash in its files. Enhancements to this scheme, such as having the database use a digital signature protocol and sign a concatenation of the hash and a date/time code, would make this protocol even more useful. This has vast implications with regard to privacy: Someone could copyright a document