Data Structures Using C: 10.7.3 Indexed Sequential Access

10.7.3 Indexed Sequential Access

While the hash table solution may yield fast average access times, there is no guarantee for the worst case, nor any convenient way to access records in sorted order by key value. To achieve the ability to access records in random order or in sorted order, as necessary, we must incorporate an old idea familiar to all. Suppose a visitor from outer space wanted to find the location of a particular street, say Bond Street in London. The visitor might go to a world atlas and determine first where London's country, England, is situated. The intergalactic traveler might then find a map of the country, and locate London. Finally, a city map of London gives the creature the exact location of the street. On a more mundane level, suppose a student wants to locate a particular book in the university library. The student would go to the file cabinet of authors and find the drawer that contains authors' last names with the appropriate beginning initial. She would then search for the card with the correct author and title, which gives a library call number. Next she would consult a library directory to learn the floor location of those call numbers. Finally, she would go to that location and search for the book with that particular call number. The critical concept in such searches is the availability of one or more directories, to be used to determine quickly the location of the particular item desired.

The magnetic disk supports the use of such a directory to focus the search. It can be used to find a particular cylinder and surface relatively quickly, eliminating the need for traversing sequentially through intervening records. Until recently, when B-trees (see next section) were introduced, the most popular method of creating and maintaining such a directory used the natural sequential storage of records, so that they appear in sorted order by key value. This is known as the indexed sequential access method (ISAM); it may be implemented in various ways, typically using three or four levels of directories. A first directory, known as the master index, may be searched to determine the cylinder on which a particular search key appears. If a disk pack has two hundred cylinders (tracks), then the master index has two hundred entries, one for each cylinder. The entry for the ith track gives the highest key value appearing on that track. A search of the master index then readily determines the cylinder on which a search key may reside. That cylinder is accessed; it contains a cylinder index directory. If a disk pack has ten surfaces, the cylinder index has ten entries, one for each surface. The entry for the ith surface gives the highest key value appearing on that surface. A search of the cylinder index determines the surface on which a search key may reside. Finally, the track of the cylinder on that surface may be searched sequentially for the desired key, resulting in a successful or unsuccessful search. With such an implementation, unused storage is initially left in blocks so that the insertion of new records may be more easily accommodated. If a block is full, records can be shifted to the right to other blocks, so that the new record is inserted in proper order.

Overflow areas must also be available and maintained on disk, in case the track storage is exhausted due to insertions. Tracks and cylinders are set aside for this purpose. Insertions may require the master or cylinder index to be updated. Similarly, deletions of records, leaving vacated storage to be reclaimed, may require updating of directories and shifting of records. Deletions may also be treated simply by marking a record as deleted, but this may eventually lead to the need for reclamation of this storage, degrading the efficiency of the implementation. Because of overflow from insertions, performance may degrade, necessitating reorganization, allocation of more disk storage, and reconstitution of the file.

The overflow areas may be maintained as lists of records in sorted order, linked by pointers. When the proper cylinder index is searched, the pointer to be followed may point to an overflow area, in which case the search amounts to a traversal of a list of records. Otherwise, the record is said to reside in the prime area, and the search amounts to a sequential traversal of records on a track. The prime area consists of reserved tracks on cylinders designated as prime cylinders.

For a given number of records to be stored, there is no guarantee that overflow will not degrade the system, unless a high price is paid by dedicating large amounts of track storage to reduce this possibility.

Each cylinder index actually has additional entries giving the highest key value in its overflow area for each surface and a pointer to the first record of the overflow list for the surface. Conceptually, we may visualize the organization of such implementations as shown in Figure 10.10.