A hash table differs from the array that provides storage for it because store and retrieve operations are specified by a *key* value rather than by an index and the number of legal keys is larger than the number of cells in the array. The operations of interest are STORE(*Key*) and RETRIEVE(*Key*). It is possible to simply store the key and retrieve a Boolean function that indicates the presence or lack of presence of the key in the table. This is the simple way to discuss hash tables, but it is more likely that the key is indeed just a key in a record containing a *Value* field or a pointer (or index) that addresses some bulk of information. In a word, hash tables are apt to be *exogenous*--directories to data stored external to them.

The function *HF*_{1} of section 2.5 simply converts letters to integers and sums them. It exhibits two of the properties needed to form a hash table:

*HF _{1} *transforms keys into integers

*HF*_{1} also exhibits an unhappy consequence of compression by hash functions:

HF_{1}(BROTH) =HF_{1}(OLIVE)

Compression causes *collisions*.

The address functions of section 2.3 transform a domain value that is an *n*-tuple of the form

[i_{1},i_{2}, . . .,i]_{n}

into a range value that is a single index in a one-dimensional table. The correspondence is one-to-one, and so there is no compression; but compression can be introduced much as it was for *HF*_{1}. For example, a function *HF*_{2} can be defined by

HF_{2}([i_{1},i_{2}, . . .,i_{n}]) = (i_{1}+ +i_{n}) MOD 25

If a key of any kind can be located in an *n*-dimensional table at

t= [i_{1},i_{2}, . . .,i_{n}]

then *HF*_{2}(*t*) provides the same service as for *any n*-dimensional table of keys as *HF*_{1} does for its specific data space.

Both *HF*_{1} and *HF*_{2} fail to define their range as a hash table structure by themselves because compression allows *collisions*: two elements of the domain can have identical range values. In any table structure, RETRIEVE must be the inverse of STORE, and so collisions must be resolved. With this in mind, the desired HASH TABLE structure consists of:

a domain

a range

a hash function h that exhibits compression

a resolution of the collisions caused by h

operations CREATE, STORE, and RETRIEVE

The domain is normally a given in a general sense and may be enormous. For example, it might be identifiers of up to 30 characters in length. The subset of the domain encountered in a specific application tends to be smaller but not well defined, and that makes a compressive system very attractive. CREATE requires no special attention. The range is a design decision that takes into account much of this section.

Suppose that a hash function is available that distributes the domain keys over the indices 0,1, . . . ,24 in a table *HT* with *uniform probability*. The first entry will not collide with any other. The probability that the second entry will not collide with a previous entry is 24/25. The probability that the third, fourth, . . . entries will not collide when entered is 23/25, 22/25, . . . (directly proportional to the number of free spaces). The probability that there will be no collisions for seven entries in a table of 25 cells is:

Hence the probability that there *will* be a collision is near 60 percent even though the table is less than one-third full. The cause is not really the small size of the table. (Figure out how likely a birthday collision is in a room containing 25 people.)

The ratio of *n/m* for *n* items in a table of size *m* is called the *load factor*, . In this example, = 7/25 = 0.28. Clearly, the probability of a collision increases with .

The probability that a table location will be accessed is affected by two things:

the selection of elements from the potential set (the domain)

the transformation of the selected elements by the hash function into range indices

h(z)= INT(m(zMOD1))

To use the function *h* as a hash function first requires that the keys be turned into an integer in some efficient way. Keys are stored as binary strings in memory, and a binary string can be decoded as an integer (or sequence of integers) or converted with a utility function (such as *ORD* in Pascal). For the sake of illustration, consider the following simple approach applied to the ingredients of the recipe in section 2.5:

**1.** Ingredients are considered to be 12 characters in length, left-adjusted and padded with blank spaces.

**2.** Letters have as their index a two-digit value: *A* = 01, *B* = 02, . . ., *Z* = 26. Blanks have value 00. These values are found by a table lookup, which is much faster than calculation of them.

**3.** An ingredient is six four-digit integers. For example, BELLPEPPER is 0205, 1212, 1605, 1616, 0518, 0000.

**4**. The six integers for an ingredient are summed to form *z*. In functional terms, *q*(BELLPEPPER) = 5156 = *z*.

The calculation of *q* is much faster than it might appear to be. Let *s*[1 . . 12] be the string of characters of an ingredient. Then a calculation of *z* requires only one multiplication:

zs[1] +s[3] +s[5] +s[7] +s[9] +s[11]

zzX 100 +s[2] +s[4] +s[6] +s[8] +s[10] +s[12]

It is possible for *z* to be as large as 6 X 2626 = 15,756, but taking the sum *MOD* 10,000 restricts it to four digits (and more than that are unlikely anyway).

Now *h*(*z*) for BELLPEPPER, with *m* = 17 is:

h(z)=INT(m(zMOD1))

=INT(17(5156MOD1))

=INT(17(3186.5827MOD1))

=INT(17(0.5827))

=INT(9.9059)

= 9

For contrast, a second stage of multiplication by the prime base *m* can be introduced:

h_{2}(z) =INT(m(m(zMOD1)MOD1))

=INT(17(9.9059MOD1))

= 15

Both *h* and *h _{2}* can be easily determined with a hand calculator or a small program as a check on how well they work on a particular data set. Multiplication by 100 can be replaced with multiplication by a power of 2 and hence--in assembly languages--by shifts. There are a large number of possible variations of this scheme.

The results, for the 13 ingredients (including BROTH) and prime bases 17, 31, and 47, are shown in Table B.1 .

m= 17m= 31m= 47

z h h_{2 }h h_{2 }h h_{2}

-------------------------------------------

BEEF 0711 7 8 13 2 19 39

BELLPEPPER 5156 9 15 18 1 27 18

BLACKPEPPER 5342 9 2 16 20 25 11

BROTH 2538 9 11 17 20 26 37

CARROT 3639 0 7 0 24 1 8

CUMIN 3030 10 15 19 28 30 9

DILLWEED 2330 0 5 0 18 0 41

MUSHROOM 6557 7 10 13 27 21 3

OLIVE 2934 5 4 9 20 14 29

ONION 2443 14 9 26 17 40 12

POTATO 5631 2 9 4 18 6 46

SALT 3121 15 0 27 12 41 25

TOMATOPASTE 9352 14 8 26 13 40 4

*Problem **PB.1 is appropriate at this point.*

The unavoidable conclusion is that any practical use of hash tables produces collisions. One way to implement collision resolution is a technique called *linear probing.*

Suppose a procedure *Hash* calculates an index *k* in a hash table *HT*[*0 . . m *- 1] for *Key*. If the table is empty, STORE(*Key*) would simply be *HT*[*k*]* ** Key.* If it is not, *HT*[*k*] may already contain a key, and a collision occurs. The first task of a STORE operation is thus to determine if *HT*[*k*] is empty. For that purpose *HT* might be initialized with a value that cannot be a key or associated with a flag array. For the recipe example and most identifiers, *HT* may be initialized to null, blank, or empty strings.

functionProbe(Start,s) {O(m)

iStart

repeat

ifHT[i] =s

then returniendif

i(i + l)MODm

untili=Start

return-i

end{Probe

procedureStoreHT(Key) {O(m)

kHash(Key)

dexProbe(k,Blank)

ifdex<0

then{deal with a full table

elseHT[dex]Key

endif

end{StoreHT

functionRetrieveHT(Key) {O(m)

kHash(Key)

dexProbe(k,Key)

ifdex<0

then{report not there

else returndex

endif

end{RetrieveHT

m= 17

HT[7] BEEF 0 CARROT

HT[9] BELLPEPPER 1 DILLWEED

HT[9]collision2 POTATO

HT[10] BLACKPEPPER

HT[9]collision

HT[10]collision5 OLIVE

HT[11] BROTH

HT[0] CARROT 7 BEEF

HT[10]collision8 MUSHROOM

HT[11]collision9 BELLPEPPER

HT[12] CUMIN 10 BLACKPEPPER

HT[0]collision11 BROTH

HT[1] DILLWEED 12 CUMIN

HT[7]collision

HT[8] MUSHROOM 14 ONION

HT[5] OLIVE 15 SALT

HT[14] ONION 16 TOMATOPASTE

HT[2] POTATO

HT[15] SALT

HT[14]collision

HT[15]collision

HT[16] TOMATOPASTE

*Problem **PB.2 is appropriate at this point.*

Two things about the sequence of *HT* entries are worth noting:

0Hash(z)< m

0HashAgain(z)< m

Hash(z)HashAgain(z)

2051 + 2121 + 6051 + 6160 + 5180 + 0000 = 21,563

Or simply use 1 - in place of , or use *h _{2}*.

With *q, Hash = h, HashAgain = h2, m* = 31, and alphabetical entry order:

HT[13] BEEFm= 31

HT[18] BELLPEPPER 0 CARROT

HT[16] BLACKPEPPER

HT[17] BROTH

HT[0] CARROT

HT[19] CUMIN 4 POTAT0

HT[0]collision5 DILLWEED

HT[18]collision

HT[5] DILLWEED ( = 18)

HT[13]collision8 TOMATOPASTE

HT[9] MUSHROOM ( = 27) 9 MUSHROOM

HT[9]collision

HT[29] OLIVE ( = 20)

HT[26] ONION

HT[4] POTAT0 13 BEEF

HT[27] SALT

HT[26]collision

HT[8] TOMATOPASTE ( = 13) 16 BLACKPEPPER

17 BROTH

18 BELLPEPPER

19 CUMIN

26 ONION

27 SALT

29 OLIVE

*Problem **PB.3 is appropriate at this point.*

An alternate implementation of hash tables that permits deletion is based on the linked lists described in Chapter 3, and it is to be found in section L.

Problems not immediate, but requiring no major extensions of the text material

**PB.1 **Construct a table of hashed indices like the one of section B. 1 for the calendar months.

**PB.2 **Construct the hash table for the calendar months entered in calendar order, into a hash table of prime base *m* = 17.

**PB.3 **Construct the double-hashing table for the calendar months in calendar order with *Hash = h*, *HashAgain* = *h*_{2}, and *m* = 31.

Programs for trying it yourself

**PGB.1 **Write a program that accepts identifiers of up to 12 characters in length, converts them to an integer with *q*, and stores them in a hash table *HT* with the use of *Hash = h,* *HashAgain = h*_{2}, and *m* = 47. The program should report the number of collisions, the average number of tests by *Probe,* and the loading factor.