Resizable Data Structures

Dr. Dobb's Journal January 1998

By John Boyer

John is a software development manager at UWI Unisoft Wares Inc. (http://www .uwi.com/) and is also a part-time theoretical computer science graduate student at the University of Victoria. He can be contacted at jboyer@ uwi.com.

Most textbooks discuss basic data structures such as heaps and hash tables by starting with a fixed-size array. Although they are fast, fixed-size arrays are rarely sufficient in practice -- modern software has to handle whatever data is thrown at it, and presized data structures aren't flexible enough to meet such requirements. Conversely, linked structures are flexible, but are more complex, and memory fragmentation can degrade performance. This month, John Boyer explains an easy, fast way to resize arrays and the data structures that use them.

-- Tim Kientzle

Edward Sitarski described how Hashed Array Trees (HATs), coupled with some sophisticated programming techniques, can be used for growing and shrinking arrays in his September 1996 "Algorithm Alley" (available electronically; see "Resource Center, page 3). In this article, I'll show how to do the same thing using techniques that are so simple you can immediately incorporate them into your programming repertoire. Furthermore, you can apply these techniques to existing programs without changing data structures or accidentally introducing new bugs.

I developed these techniques while adding a resizable hash table to the Universal Forms Description Language (http:// www.uwi.com/). The source code for a resizable hash table that uses resizable arrays for its collision lists is available electronically. You can also get code for a resizing binary heap from my January 1997 "Algorithm Alley" on Fibonacci heaps (also available electronically).

Building Some Suspense

As a hash table becomes overcrowded, the benefit of hashing diminishes. But resizing the table requires reinserting all of the elements, since the final operation on a hash key is a modulus by the table size. Without some tricks, your program will suffer every time you have to enlarge or reduce the table.

I recently wrote a lexical analyzer as part of a new parser generator. This analyzer had to be able to read quoted strings of any length. Since the syntax of many languages does not allow the size of a quoted string literal to be specified, it is necessary to grow the string dynamically until its end is found.

As part of my January 1997 Fibonacci heap article, I built a binary heap for comparison with the F-heap. Since F-heaps can grow arbitrarily large, I built the binary heap to reallocate its array when it was full. Binary heaps are assumed to have an O(log n) insert, but if you look closely, this is based on the assumption that space for the new element can be procured in O(log n) time or less. Most textbooks use a fixed array for storage, so there is no overhead in procuring space for new elements. However, it occurred to me that my expansion code would cause O(n) behavior per insertion.

The Extensible Array Problem

These problems can all be solved with an array that can grow dynamically. The obvious approach is to add k elements each time you need more space. Unfortunately, the C function realloc() may copy the entire array to the new storage. If you use C++ new/delete operators, then you must manually perform a reallocation, which guarantees that your array contents must be moved on expansion. So, as your array grows up to some arbitrary size n, you will be moving your array elements a total of k+2k+3k+4k+...+n-k times. The result is O(n²) overall, or O(n) per element.

Consider the aforementioned examples. The lexical analyzer would take O(n²) time to read an n character string, which is unacceptable for parsers (which must achieve O(n) performance to be viable). The binary heap insert was slowed from O(log n) down to O(n), and the hash table insert goes from O(1) down to a pitiful O(n). You can reduce the hidden constant by increasing the expansion size k, but the overall result is still O(n) per array element.

The Solution

The solution is to double the size of the array whenever it must be expanded. Listing One is lex_growbuffer(), which is typical of resizable arrays. The buffer size is doubled rather than being increased by a constant value. Also, this example reallocates the array down to the actual string size once the string terminator is found.

In C++, you have no choice but to copy the array to change its size. As the array size doubles 2,4,8,...,n, you must copy arrays of sizes 1,2,4,...n/2. This is a total of n-1 elements being copied. You expand to size n when you insert the element (n/2+1). Thus, the amortized cost of copies during reallocation never exceeds two time units, which is constant amortized time per element.

At the worst moments, the array will be twice as big as necessary, but this trade-off is more than accounted for by the increased speed of reallocation, and the utter simplicity of using this technique in both new and existing code.

What About Shrinking?

In some applications, it may be necessary to have a delete operation for the resizable array. The binary heap is a good example. After any random sequence of insert and delete operations, it would be helpful if the size of the heap's array were bounded to be a linear factor of the number of elements in the heap. This would ensure that a local "spike" of insert operations couldn't cause the heap to continue occupying unneeded memory.

One approach is to shrink the array by half precisely when it becomes half full. In general, this is the right idea, but it has a weakness. Suppose the random sequence happens to be 1024 inserts followed by an alternating sequence of insert and delete operations. The first insert in the alternating sequence will cause the heap array to grow to 2048 elements in size, since this is the smallest power of two that is greater than 1025. The first delete will reduce the table size to 1024 elements, which would cause an array resize to 1024. The next insert would grow the array again, and the next delete would shrink the array again. Since each operation is resizing the array, each operation will have O(n) performance.

Although most random sequences of insert and delete operations will not cause this problem, I want to guarantee that the worst case will be amortized constant time per operation. To do this, I shrink the array to half size when it is one-quarter full. The worst case sequence would then only alternate the array size between 2^k and 2^(k+1) by using inserts and deletes to alternate the number of elements between 2^(k-1) and 2^k+1. When you reach 2^k+1 elements, you move 2^k elements to a new array before inserting the newest element. You then need a sequence of at least 2^(k-1)+1 deletes to cause an array shrink, which will move the 2^(k-1) remaining elements to a smaller array. There must be 2^(k-1)+1 more inserts to get back up to size 2^k+1 and cause an array growth. Thus, each sequence of 2^(k-1)+1 deletes followed by 2^(k-1)+1 inserts will cause shrinking and growth. This resizing results in 2^(k-1) moves after the deletion phase and 2^k moves after the insert. This is clearly still constant time per operation since you do fewer than 2^k+2^k-1 element moves for each 2^k+2 operations.

Thus, even with deletes, you can maintain constant time per operation and the data structure will never use more than four times the required memory -- even if there are large spikes of insert operations in the sequence.

The Hash Table

I was introduced to resizable hash tables in Introduction to Algorithms, by T. Cormen, C. Leiserson, and R. Rivest (MIT Press, 1990). It was only while building the binary heap included in my Fibonacci heap article that I realized the applicability of the doubling technique to all extensible array problems.

The key to resizable hash tables is simply creating a new array as described earlier, copying the existing elements to it using the hash function for insertion, then destroying the old array. Rebuilding the entire table may seem like brute force, but as far as amortized time complexity goes, there is no difference because copying an element is constant time and hash-table inserting is "expected" constant time. Using a hash function might take longer than a memory copy, but this is simply a larger constant.

realloc() is not appropriate for resizing a hash table because two differently sized tables will store many elements in completely different places. Thus, you are forced to manually move objects from the old array into the new array.

Java and the STL

The Java hash-table implementation uses array doubling to increase the hash table size, but it doesn't shrink the hash table when it becomes smaller. The C++ Standard Template Library (STL) vector class, which is used as the basis for the priority queue (binary heap), also uses array doubling to increase the size of the vector, but a request to reserve an amount of memory smaller than the current capacity of the container is ignored. This should be changed because data structures (such as heaps used in long-running programs like servers) must be able to reduce their memory allocation after a spike of data.

The current STL hash-table implementation does not automatically increase or decrease the number of buckets in the hash table (each bucket grows to accept collisions, which increases the load on the hash table). There is a resize() function that will rebuild the hash table with the number of buckets given as a parameter. So, if you are using the STL hash table, then use the resize() function only to double or halve the number of buckets when the hash table is full or one-quarter full, respectively. This will ensure an expected O(1) collision list size.

Good Hashing Functions

When using this technique of varying the number of buckets in the hash table to keep the load down, it is important to have a good hashing function that achieves high element dispersion regardless of table size. The efficiency of a hash function is measured by dividing the number of buckets by the number of elements in the hash table. A perfect hash has 100 percent efficiency because each element has its own bucket.

Listing Two is a simple hash function that performs calculations using the binary codes of the characters in a key. The record's position in the table is calculated based only on the key value itself, not how many elements are in the table. This is why hash-table inserts, deletes, and searches are rated constant time.

HashJB alternates addition and multiplication. Addition is much faster, but multiplication causes the hash value to grow quickly (so that it can reach virtually any position in the table regardless of how big it gets). Also, notice that I perform the modulus Key%N using Key&(N-1). Because the array size doubles, N is always a power of two, so the modulus can be replaced with the much faster subtract and bitwise-and.

Truncating the resulting key value is a problem, since it does throw away information in the upper bits of the hash value. I later added two lines that mix the upper bits into the lower bits, which reduced the number of collisions in my tests.

I also investigated other hash functions. When I tried Peter J. Weinberger's hash function (see "Hashing Rehashed," by Andrew Binstock, DDJ, April 1996, available electronically), my test program slowed noticeably. Further investigation revealed that hash tables generated with HashPJW() contained many more collisions than those generated with HashJB(). This was slowing down the UFDL form-load time because, after parsing the form and building the hash table, the compute engine makes extensive use of the hash table. Why was HashPJW() hashing so poorly? Like the first draft of HashJB(), it doesn't seem to account for the information lost during the modulus truncation of the hash value. When I applied the code to mix the higher bits into the lower bits, the efficiency of HashPJW() increased dramatically -- nearly as high as HashJB() on my sample forms.

More recently, Bob Jenkins presented a hashing function ("Algorithm Alley," DDJ, September 1997, also available electronically) that performs a complex mixing operation using 12 bytes at a time from the input key. By design, the bytes of the key have some affect on all regions of the hash value. Thus, the hash function suffers little or no information loss when the hash value is truncated to the table size. As Table 1 shows, the efficiency of Jenkins' hash function was far greater than the original HashPJW(), and even slightly better than HashJB() and the new HashPJW(). The run time using HashJB() is equivalent to HashJenkins() because the former uses slightly less CPU time to compute hash values.

Conclusion

In over half a dozen projects, I've found that the array-doubling/halving technique was responsible for an order of magnitude performance increase when compared to reallocation by a constant size. Furthermore, the idea was typically applied to large existing projects in minutes, rather than the extensive redesign and recoding efforts that might be required by new data structures. The technique does use extra memory; sometimes this can be eliminated (as with the lexical analyzer that downsizes strings). But even when the extra memory usage cannot be eliminated (as with a hash table), the trade-off is usually more than accounted for by the need to increase execution speed and the utter simplicity of using this technique.

DDJ

Listing One

char *lex_growbuffer(char *Buffer, long *pBufferSize){
char    *TempP;
long    NewSize;
    if (*pBufferSize==0) NewSize = 16;
    else NewSize = *pBufferSize * 2;
    TempP = (char *) realloc(Buffer, NewSize+1); 
    if (TempP != NULL) *pBufferSize = NewSize;
    return TempP;
}
  ... /* Snippet to add a character to the lex buffer */

   if (BufferLen == BufferSize)
        Buffer = lex_growbuffer(Buffer, &BufferSize);
    if (Buffer == NULL) return NULL;
    Buffer[BufferLen++] = NewChar;
    Buffer[BufferLen] = '\0';
    return Buffer;
  ... 
/* After string is read, reduce to exact size */
    Buffer = (char *) realloc(Buffer, BufferLen+1);

Back to Article

Listing Two

typedef unsigned long ulong;typedef unsigned char uchar;
ulong HashJB(uchar *KeyStr, ulong N)
{
ulong Hash=0, I=0;
uchar Ch;
    for (;Ch=*KeyStr++; I^=1)
        if (I)
            Hash *= Ch;
        else    Hash += Ch;
    Hash += ((Hash&0xFFFF0000)>>16);
    Hash += ((Hash&0x0000FF00)>>8);
    return Hash & (N-1); 
}

Back to Article

Back to Table of Contents