Data Structures Using C: 9.3.3 Random Hashing

9.3.3 Random Hashing

A great deal of literature in computer science is devoted to the selection of hashing functions and collision resolution policies. To give some idea of what can be achieved, this section states results for an idealized scheme that assumes the hashing function to be perfectly random. Given any key value, the scheme assigns an address to that key value by picking randomly from among all the addresses of the table. This random selection is made so that each of the m addresses in a table of size m is equally likely to be selected. The table is built by inserting, in turn, each given key value. A key value is stored by determining its hash address. If that address is empty, the key value is stored there. If that address is occupied, the collision resolution policy works as follows. Another random hash function again selects an address at random. If that address is empty, the key value is stored there. If it is occupied, this procedure is repeated until the key value is finally entered. Thus an insertion of a key value may require the generation of a sequence of hash addresses ending with the actual final storage address.

After the table is built, searches are performed as follows. Trace the sequence of hash addresses that were generated to store the search key (if it is in the table) or that would have been generated (had the key been stored). The basic assumption made here is that the hash functions generate precisely the same sequence of hash addresses for the search as for building the table. This may seem like a strange way for random hash functions to work, but it is our assumption nevertheless. Table 9.3 shows the theoretical results for such a scheme, when n keys are stored in the table of size m.

The successful and unsuccessful search columns are given approximately by the functions (1/a )1n 1/(1-a) and 1/(1-a) where a = n/m. The remarkable fact is that the average search time does not depend on n, the number of records stored in the table, but only on the usage ratio, n/m. Thus storage can be traded for speed in hash table searches. For example, if n is 10,000 and m is 11,112, a is about 0.90, and an average of only 2.56 probes will be required for a successful search. Contrast this with the 5,000 comparisons or probes needed for the linear search in the random case, or the 12.29 comparisons or probes needed for a binary search. Actual hashing functions and collision resolution policies can come close to achieving this kind of average behavior. The worst-case times for hash table searching can, however, be quite large.

Hash tables are not a searching panacea. A major disadvantage is that entries will not be in order by key. The only way to access the records in sorted order is to sort them. However, if this isn't important, hashing is often a good approach.

Table 9.3 Average Number of Probes for Searches

                 Successful  Unsuccessful

Usage Ratio n/m   Search       Search

-----------------------------------------

      .25          1.15         1.33

      .50          1.39         2.00

      .75          1.85         4.00

      .80          2.01         5.00

      .90          2.56        10.00

      .95          3.15        20.00