Data Structures Using C: 9.3.7 Hashing Overview

9.3.7 Hashing Overview

Let us review the advantages and disadvantages of hashing. First, note that some estimate is needed in advance for the maximum number of entries to be stored in a hash table, since its size must be declared. This is important; too high an estimate wastes space, and too low an estimate means a large usage ratio and poorer performance. Second, the hash function must be carefully chosen so that it can be computed quickly and generates relatively few collisions.

The main characteristic of hashing is its excellent average search, insertion, and deletion time. However, when deletions are frequent, care is required in deciding which collision resolution policy to use. The major disadvantages of hashing are the possibility of large worst-case times and the lack of support for traversal through the records in sorted order. Also, the only information obtained from an unsuccessful search is that the record isn't in the table. In the data structure considered next, more information may be obtained.

Choosing the appropriate hashing technique involves consideration of time and storage as well as the relative frequency with which successful searches, unsuccessful searches, and deletions will occur. However, some generalizations can be made.

Using open addressing to resolve collisions can be quite efficient for moderate usage ratios. Although the average times for successful searches depend on whether simple linear probing or more complex probing such as double hashing is used, the differences are small. These differences are more significant for unsuccessful searches, so more complex probing is desirable when they will occur frequently. Ordered hash tables can significantly reduce the search times in this case. For larger usage ratios the chaining methods have superior search times.

With open addressing, when deletions are frequent, the table can actually be storing relatively few entries, with many table positions simply marking deleted entries. Thus the table performance is governed by a usage ratio that does not reflect the actual, much smaller, number of positions used to store records. Separate chaining uses storage more efficiently in this case, and deletion is easy. Separate chaining also has better average time behavior than the other methods.

When records are small, the links required by chaining take significant extra storage compared to open addressing, whereas the opposite is true when records are large. When record size is moderate, the extra storage for links using separate chaining must be weighed against the reduced usage ratio that could be obtained if this storage were used in open addressing.