Data Structures and Algorithms: CHAPTER 4: Basic Operations on Sets

Notice that we have chosen to pass only the higher rung number to EXCHANGE, while the other two operations take a name as argument.

We might, for example, choose an array LADDER, where LADDER[i] is the name of the person on rung i. If we also keep a count of the number of players, adding a player to the first unoccupied rung can be done in some small constant number of steps.

EXCHANGE is also easy, as we simply swap two elements of the array. However, CHALLENGE(name) requires that we examine the entire array for the name, which takes O(n) time, if n is the number of players on the ladder.

On the other hand, we might consider a hash table to represent the mapping from names to rungs. On the assumption that we can keep the number of buckets roughly proportional to the number of players, ADD takes O(1) time, on the average. Challenging takes O(1) time on the average to look up the given name, but O(n) time to find the name on the next lower-numbered rung, since the entire hash table may have to be searched. Exchanging requires O(n) time to find the players on rungs i and i - 1.

Suppose, however, that we combine the two structures. The cells of the hash table will contain pairs consisting of a name and a rung, while the array will have in LADDER[i] a pointer to the cell for the player on rung i as suggested in Fig. 4.27.

We can add a name by inserting into the hash table in O(1) time on the average, and also placing a pointer to the newly created cell into the array LADDER at a position marked by the cursor nextrung in Fig. 4.27. To challenge, we look up the name in the hash table, taking O(1) time on the average, get the rung i for the given player, and follow the pointer in LADDER[i-1] to the cell of the hash table for the player to be challenged. Consulting LADDER[i - 1] takes constant time in the worst case, and the lookup in the hash table takes O(1) time on the average, so CHALLENGE is O(1) in the average case.

EXCHANGE(i) takes O(1) time to find the cells for the players on rungs i and i-l, swap the rung numbers in those cells, and swap the pointers to the two cells in LADDER. Thus EXCHANGE requires constant time in even the worst case.

Exercises

4.1	If A = {1, 2, 3} and B = {3, 4, 5}, what are the results of UNION(A,B, C) INTERSECTION(A, B, C) DIFFERENCE(A, B, C) MEMBER(1, A) INSERT(1, A) DELETE(1, A) MIN(A)?
*4.2	Write a procedure in terms of the basic set operations to print all the elements of a (finite) set. You may assume that a procedure to print an object of the element type is available. You must not destroy the set you are printing. What data structures would be most suitable for implementing sets in this case?
4.3	The bit-vector implementation of sets can be used whenever the "universal set" can be translated into the integers 1 through N. Describe how this translation would be done if the universal set were the integers 0, 1, . . . , 99 the integers n through m for any n ?/FONT> m the integers n, n +2, n +4, . . . , n + 2k for any n and k the characters 'a', 'b', . . . , 'z' arrays of two characters, each chosen from 'a' through 'z'.
4.4	Write MAKENULL, UNION, INTERSECTION, MEMBER, MIN, INSERT, and DELETE procedures for sets represented by lists using the abstract operations for the sorted list ADT. Note that Fig. 4.5 is a procedure for INTERSECTION using a specific implementation of the list ADT.
4.5	Repeat Exercise 4.4 for the following set implementations: open hash table (use abstract list operations within buckets). closed hash table with linear resolution of collisions. unsorted list (use abstract list operations). fixed length array and pointer to the last position used.
4.6	For each of the operations and each of the implementations in Exercises 4.4 and 4.5, give the order of magnitude for the running time of the operations on sets of size n.
4.7	Suppose we are hashing integers with a 7-bucket hash table using the hash function h(i) = i mod 7. Show the resulting open hash table if the perfect cubes 1, 8, 27, 64, 125, 216, 343 are inserted. Repeat part (a) using a closed hash table with linear resolution of collisions.
4.8	Suppose we are using a closed hash table with 5 buckets and the hashing function h(i) = i mod 5. Show the hash table that results with linear resolution of collisions if the sequence 23, 48, 35, 4, 10 is inserted into an initially empty table.
4.9	Implement the mapping ADT operations using open and closed hash tables.
4.10	To improve the speed of operations, we may wish to replace an open hash table with B₁ buckets holding many more than B₁ elements by another hash table with B₂ buckets. Write a procedure to construct the new hash table from the old, using the list ADT operations to process each bucket.
4.11	In Section 4.8 we discussed "random" hash functions, where h_i(x), the bucket to be tried after i collisions, is (h (x) + d_i) mod B for some sequence d₁, d₂, . . . ,d_B-1. We also suggested that one way to compute a suitable sequence d ₁, d₂, . . . , d_B-1 was to pick a constant k, an arbitrary d₁ > 0, and let where i > 1, B is a power of 2, and ?/FONT> stands for the bitwise modulo 2 sum. If B = 16, find those values of k for which the sequence d₁, d₂, . . . , d₁₅ includes all the integers 1, 2, . . . , 15.
4.12	Show the partially ordered tree that results if the integers 5, 6, 4, 9, 3, 1, 7 are inserted into an empty tree. What is the result of three successive DELETEMIN operations on the tree from (a)?
4.13	Suppose we represent the set of courses by a linked list a hash table a binary search tree Modify the declarations in Fig. 4.26 for each of these structures.
4.14	Modify the data structure of Fig. 4.26 to give each enrollment record a direct pointer to its student and course owners. Rewrite the procedure printstudents of Fig. 4.26 to take advantage of this structure.
4.15	Assuming 20,000 students, 1,000 courses, and each student in an average of three courses, compare the data structure of Fig. 4.26 with its modification suggested in Exercise 4.14 as to the amount of space required the average time to execute printstudents the average time to execute the analogous procedure that prints the courses taken by a given student.
4.16	Assuming the data structure of Fig. 4.26, given course record c and student record s, write procedures to insert and to delete the fact that s is taking c.
4.17	What, if anything, is the difference between the data structure of Exercise 4.14 and the structure in which sets C_s and S_c are represented by lists of pointers to course and student records, respectively?
4.18	Employees of a certain company are represented in the company data base by their name (assumed unique), employee number, and social security number. Suggest a data structure that lets us, given one representation of an employee, find the other two representations of the same individual. How fast, on the average, can you make each such operation?

Bibliographic Notes

Knuth [1973] is a good source for additional information on hashing. Hashing was developed in the mid-to-late 1950's, and Peterson [1957] is a fundamental early paper on the subject. Morris [1968] and Maurer and Lewis [1975] are good surveys of hashing.

The multilist is a central data structure of the network-based database systems proposed in DBTG [1971]. Ullman [1982] provides additional information on database applications of structures such as these.

The heap implementation of partially ordered trees is based on an idea in Williams [1964]. Priority queues are discussed further in Knuth [1973].

Reingold [1972] discusses the computational complexity of basic set operations. Set-based data flow analysis techniques are treated in detail in Cocke and Allen [1976] and Aho and Ullman [1977].

†Sometimes the term multiset is used for a "set with repetitions." That is, the multiset {1, 4, 1} has 1 (twice) and 4 (once) as members. A multiset is not a list any more than a set is, and thus this multiset could also have been written {4, 1, 1} or {1, 1, 4}.

†Changes to the data structure can be made that will speed up open hashing and allow closed hashing to handle larger sets. We shall describe these techniques after giving the basic methods.

†If the type of members of the dictionary does not suggest an appropriate value for empty, let each bucket have an extra one-bit field to tell whether or not it is empty.

†Note that A2 and A20 do not necessarily hash to the same bucket, but A23 and A41, for example, must do so.

†If strings can be extremely long, this addition will have to be performed modulo some constant c. For example, c could be one more than the largest integer obtainable from a single block.

†If this were a real database system, the array would be kept in secondary storage. However, this data structure would waste a great deal of space.

†Note that there are probably many more enrollment records than course or student records, so shrinking enrollment records shrinks the total space requirement almost as much.

‡It would in practice be useful to place fields like grade or status (credit or audit) in enrollment records, but our original problem statement does not require this.

†We must have some way of identifying the type of records; we shall discuss a way to do this momentarily.

†We assume this and is a conditional and.

Table of Contents Go to Chapter 5

Basic Operations on Sets

4.1 Introduction to Sets

Set Notation

Abstract Data Types Based on Sets

4.2 An ADT with Union, Intersection, and Difference

4.3 A Bit-Vector Implementation of Sets

4.4 A Linked-List Implementation of Sets

4.5 The Dictionary

4.6 Simple Dictionary Implementations

4.7 The Hash Table Data Structure

Open Hashing

Closed Hashing

4.8 Estimating the Efficiency of Hash Functions

Analysis of Closed Hashing

"Random" Strategies for Collision Resolution

Restructuring Hash Tables

4.9 Implementation of the Mapping ADT

4.10 Priority Queues

4.11 Implementations of Priority Queues

Partially Ordered Tree Implementation of Priority Queues

An Array Implementation of Partially Ordered Trees

4.12 Some Complex Set Structures

Many-Many Relationships and the Multilist Structure

Multilist Structures

Dual Data Structures for Efficiency

Exercises

Bibliographic Notes