Next Chapter Return to Table of Contents Previous Chapter

CHAPTER 6: FILE ORGANIZATIONS FOR OPTICAL DISKS

Daniel Alexander Ford and Stavros Christodoulakis

Department of Computer Science, University of Waterloo, Waterloo, Ontario, Canada

Abstract

Optical disk technology is a new and promising secondary storage technology. Optical disks have immense capacities and very fast retrieval performance; they are also rugged and have very long storage lifetimes. These characteristics are making them a serious threat to the traditional dominance of magnetic disks. It is important to understand and study this young and significant technology and to design retrieval structures that best utilize its characteristics.

This chapter first presents a tutorial on optical disk technology and a discussion of important technical issues of file system design for optical disks, many of which are quite subtle. It then proceeds to discuss six file systems that have been developed for optical disks, including the Write-Once B-Tree of Easton (1986), the Time Split B-Tree of Lomet and Salzberg (1989), the Compact Disk File System of Garfinkel (1986), the Optical File Cabinet of Gait (1988), Buffered Hashing of Christodoulakis and Ford (1989a), and BIM trees of Christodoulakis and Ford (1989b).

6.1 INTRODUCTION

In this section we discuss file structures for optical disks. We first present an overview of optical disk technology, explaining where it came from and how it works. We then discuss some technical issues affecting the implementation of file structures on some common forms of optical disk technology. Later, we discuss in detail a variety of different structures that have been developed.

6.2 OVERVIEW OF OPTICAL DISK TECHNOLOGY

The foundation for all current optical disk technology was formed in the late 1960s and early 1970s by early video disk research. In mid-1971, N. V. Philips began conducting experiments in recording video signals on a flat glass plate using a spiral track of optically detectable depressions. This system was refined until it could store 30 minutes of color video and sound, and was called Video Long Play (VLP); it was first demonstrated on September 5, 1972. At about this time, other companies also began research efforts, including M.C.A., 3-M, Thomson-CSF, KODAK and Harris, among others.

The development of small inexpensive semiconductor laser diodes in 1975 stimulated development further, and the use of plastic disk platters which were cheaper and more easily replicated than glass platters was pioneered. Eventually, the technological base for the development of the optical disks in use today was emerged from the research efforts in the fields of optics, disk material technology, tracking and focus control servo system and lasers.

Features and benefits

Optical disks have many features and benefits that make them a useful storage medium. In general, their immense capacity, fast random access, and long storage life are major advantages. Other advantages are their portability and durability. Disks usually come encased in a protective cassette that can be easily carried. They are not subject to wear or head crashes as there is no physical contact between the disk and the access mechanism. The recording surface itself is not exposed, but rather, is protected by the plastic substrate that forms the disk platter. Optical disks, including erasable optical disks which employ a magneto-optic process for data registration, are also completely unaffected by magnetic fields.

The integrity of data stored on optical disks is also impressive. The process by which data is recorded on a Write-Once Read Many (WORM) disk surface causes an irreversible physical change. There is no chance of accidental erasure and overwrites are usually prevented by the drive or by driver software. Erasable optical disks cannot prevent accidental or malicious data destruction, but are still more durable then magnetic media. CD-ROM (Compact Disk Read Only Memory) disks are physically pressed out of plastic and so cannot be written to, or erased. Further, since there is no physical contact between an optical disk platter and the access mechanism, the disk is not subject to wear with use (the head of a Winchester type magnetic disk actually rides on the lubricated disk surface before it reaches flying speed). Small scratches and dust on the disk surface do not affect stored data as they are far enough away from the recording surface to be out of the optical system's focal plane. Data is also not subject to destruction from external magnetic fields and does not require periodic rewriting as is the case for magnetic media. The expected lifetime of an optical disk is not really known for certain (they have not been around long enough), but accelerated aging tests place it at least ten years and possibly as high as thirty. In contrast, the lifetime of magnetic tape is between two years to ten years, depending how it is handled and stored.

Optical disks also have the advantage of being removable. WORM and erasable disks are encased in a sturdy protective cassette when not in the drive and can be easily and safely transported, as can CD-ROM disks, which are designed to endure unrestrained consumer use.

There are few disadvantages inherent to optical disk technology. In fact, some characteristics that might be considered undesirable are actually advantages in certain applications. The unerasability of WORM type optical disks, for instance, make them the storage medium of choice for archival type applications. Some records such as transaction logs, banking records, and school transcripts are never legitimately altered, so some systems that maintain these types of records use WORM disks to ensure this.

Technical description

Optical disks are typically available in two different access configurations, either mounted in a single disk drive or as a set of disks in a "jukebox." In the single drive arrangement, disks are mounted manually be inserting the disk (and its protective cassette for WORM and erasable) into a slot on the front of the drive. For WORM disks, the platter is actually released and the cassette withdrawn; to remove the disk, the process is reversed. In the jukebox arrangement, the disks are stored in groups of several tens or even hundreds, representing many gigabytes of storage. They are selected and mounted mechanically in a manner similar to that used in audio disk jukeboxes. The time needed to switch a disk is about 5 to 20 seconds. For both types of configurations, the drive, disks, associated access mechanisms, and optics are the same.

When mounted, the disk is clamped to a spindle that protrudes through a large hole in its center. The access mechanism is usually a sled that slides along a guide path beneath the rotating disk platter. The sled contains the laser and associated optics. Note that only one side of the disk surface can be accessed when the disk is mounted. To access the other side, the disk must be physically removed, turned over, and remounted. CD-ROM disks are single sided and cannot be accessed if improperly mounted.

Electrical interfacing to the disk drives is usually done via standard interfaces such as the Small Computer Serial Interface (SCSI), the Enhanced Small Device Interface (ESDI), or more commonly for small personal computers through a direct bus interface.

There are three common sizes and storage capacities of optical disks available. The 305.5 millimeter (12 inch) platter is the most widely used for WORM disks and has a capacity of roughly 1 gigabyte per disk side. WORM disks are also available in 203 millimeter (8 inch, 750 megabytes) and 130 millimeter (5.25 inch, 200 megabytes) sizes. Erasable disks are available in 130 millimeter (5.25 inch, 600 megabytes) and, depending on the manufacturer, 86, 87, 88, or 89 millimeter (3.5 inch, 80 megabytes) sizes. The CD-ROM disk platters are standardized at 130 millimeters. The exact capacity of a disk will depend on the recording format employed (discussed later); some formats will increase the values stated above.

The way in which a disk platter is constructed varies with the type of optical disk. WORM disks usually use a construction know as the "air sandwich." This fabrication technique joins two transparent 1.5 millimeter thick platter halves together leaving a small "clean room" air gap sandwiched between them where the recording surfaces reside. The two halves both support and protect the recording surfaces while allowing light from the laser to reach them. The final disk platter is about 3 to 4 millimeters thick. Erasable disks are fabricated in a similar manner.

CD-ROM disks are single sided and are essentially a smaller version of one of the platter halves used for WORM disks. Data is permanently registered on a CD-ROM disk when it is fabricated by a pressing process. After pressing, the disk is given a thin coating of aluminum to make it reflective and is then sealed.

Materials and registration techniques for WORM disks

A single-sided optical disk platter, which is one side of an "air sandwich," consists of a thin film of Tellurium alloy (10-50 nanometers) that forms the active recording surface, and a supporting plastic substrate, usually poly(vinyl chloride) or poly(methyl methacrylate).

The disk is formed using an injection molding process. This is done under clean room conditions as the high storage densities of optical disks makes them particularly sensitive to contaminants like dust. It is during the molding process that the disk is pregrooved and sometimes also preformatted with address and synchronization information. These steps are important as they define the positions of tracks and identify individual sectors on what would otherwise be a featureless surface.

Data is recorded on the disk by a series of optically detectable changes to its surface. Small changes in reflectivity called "pits" are thermally induced in the active layer through the application of energy from a laser. The unaltered spaces between pits are called "lands." It takes 50-100 nanoseconds of 10 milliwatts of incident power focused on a 1 micron diameter spot to form the pit. Spacing between adjacent tracks of pits and lands is 1.6 microns.

There are several methods by which the applied energy can form pits in the active layer. Of these, the techniques called Ablation and Vesicular are favored.

With the Ablation technique, surface tension causes the thin film to pull away from the spot heated by the laser, leaving a hole in the surface. Sometimes, a layer of aluminum is deposited beneath the tellurium film to act as a mirror to be exposed when the hole is formed. With the Vesicular technique, a bubble is formed in the recording surface when vapor resulting from the heating process is trapped under the film. The bubble forms a bump that tends to disperse the light hitting it (making the pit darker).

Materials and registration techniques for erasable disks

Erasable optical disks are physically much like WORM disks. The main difference between the two is in the coating used on the recording surface. There are two types of erasable optical disk: Phase-Change and Magneto-Optic.

Erasable optical disks that employ phase-change technology rely on coatings for the recording surface consisting of thin films of tellurium or selenium. These coatings have the ability to exist in two different optically detectable states, amorphous and crystalline, and will switch between the two when heated to two different temperatures. If a spot on the recording surface is heated to a low temperature with a laser with 8 milliwatts of power, the spot will crystallize. If it is heated to a higher temperature with a laser with 18 milliwatts of power, the spot will melt and when it cools will revitrify to the amorphous state.

To register data on the recording surface, the power of the laser scanning the disk is simply modulated between 8 and 18 milliwatts. If a crystallized spot is scanned with 8 milliwatts of power, it will remain crystallized, if it is scanned with 18 milliwatts of power it will switch to the amorphous state. Similarly, if an amorphous spot is scanned with l8 milliwatts of power, it will remain in the amorphous state, and if it is scanned with 8 milliwatts of power it will switch to the crystallized state.

Reading data from the disk is simply a matter of scanning the surface with a low power laser (1 milliwatt) and detecting the changes in reflectivity that exist on the recording surface.

Erasable optical disks that employ magneto-optic technology store data on the disk magnetically, but read and write it optically. The coating used on the recording surface is a rare-earth transition-metal alloy such as terbium iron cobalt, terbium iron, or gadolinium terbium iron. It has the property of allowing the polarity of its magnetization to be changed when it is heated to a certain temperature (150 C).

Recording data is a two-stage process requiring two passes of the disk head over a sector. The first pass serves to erase the contents of the sector. This is done by first placing a magnetic field with north-pole down in the vicinity of the spot upon which the laser focuses. As the sector is scanned, this spot will quickly heat to 150 C and then immediately cool. As it does, domains of north-pole down are recorded throughout the sector; a north-pole down domain represents a 0 bit. Once the sector has been erased by the first pass, data can be written by the second.

When the sector is scanned a second time, the applied magnetic field is reversed. By modulating the power of the laser, selected portions of the sector can be heated to the required temperature and have their magnetizations reversed to north-pole up; a north-pole up domain represents a 1 bit. The remaining unheated portions of the sector retain their north-pole down magnetization.

Recorded data is read from the disk in a single pass. Reading relies on a physical effect known as the Kerr magneto-optic effect, which was discovered by Kerr (1877) and which causes light passing through a magentic field to become elliptically polarized. This effect allows the magnetization of a spot or domain of the disk surface to be detected optically. To read a disk sector, it is scanned with the laser in a lower power mode (1 milliwatt) than used for writing (8 milliwatts). If the magnetization of a domain being scanned is north-pole-up, the polarization of the light reflected from the surface will be rotated clockwise, if north-pole-down, counter-clockwise. The sequence of polarity changes is detected and interpreted to produce a bit stream.

The main stumbling block in the development of erasable magneto-optic disk technology was the chemical instability of the coating caused by its repeated heating and cooling during the write-erase cycle. This instability limited the number of cycles that could occur and was caused by the high temperature (600 C) required to change the magnetization of a domain. The development of newer coatings that require lower temperatures (150 C) solved this problem. Current erasable magneto-optic disks now allow some ten million or more write-erase cycles.

There are performance differences between the two types of erasable technologies. The Kerr effect only causes about a 1 percent change in the polarization of the reflected light. This requires more optics and associated hardware in the disk head of a magneto-optic disk drive to detect such a small change. The reflectivity difference between the two states of phase-change type erasable optical disks is relatively large and much easier to detect so the disk head can be much simpler. The net result is that the seek performance of a magneto-optic disk drive will generally be poorer than that of a phase-change drive because of its more massive disk head.

Optics

The optical assemblies found in all types of optical disk drives are similar in nature and resemble those of a medium power microscope. The task of the assembly is to focus a beam of coherent light from a semiconductor laser on to a 1 micron size spot and provide a return path for reflected light to reach a photodetector. This is a difficult feat to accomplish as the requirements for economical mass production of disks and drives imply a certain degree of flexibility in their precision. As such, it cannot be guaranteed that the disk platter will be perfectly flat or round, or that the hole for the drive's spindle will be perfectly centered. These imperfections can cause the outer edge of the disk to move up and down as much as 1 millimeter and the disk itself side-to-side as much as 60 micrometers (37 tracks) as it rotates several times a second. These motions cause the position of a track to vary with time in three dimensions and require the optical assemblies to be much more than a simple arrangement of lenses.

To follow the moving track, the objective lens is encased in a voice-coil like arrangement that allows it to be moved up and down to adjust its focus and from side to side to allow it to follow the wandering track. This radial flexibility gives most optical disk drives the ability to quickly access more than one track on the disk from a single position of the access mechanism, simply by adjusting the position of the objective lens. This viewing capability is usually limited to a window of some 10 to 20 tracks on either side of the current position of the access mechanism.

Lasers and light paths

The light source in an optical disk drive is a Gallium Arsenide (GaAlAs) laser diode with a wavelength of about 800 nanometers (.8 microns). The first experimental drives employed gas lasers (HeNe) which have a shorter wavelength and allowed higher storage densities and data transfer rates, but they were later abandoned in favor of the longer wavelength semiconductor laser diodes, which are cheaper and small enough to be mounted in the optical assembly. Laser diodes also have the advantage of being modulated electrically, eliminating the need for the expensive external acoustooptic or electrooptic modulator required by the gas lasers.

The recording density is limited by the wavelength of the laser because the wavelength determines the size of the smallest feature that can be resolved on the disk surface. The shorter the wavelength, the smaller the feature and hence, the higher the possible recording density.

The light path in the optical assembly consists of a collimating lens, a polarizing beam splitter, a quarter-wave plate, and an objective lens. The collimating lens takes the highly divergent light from the diode and forms a straight directed beam. This beam passes unchanged through the beam splitter and on to the quarter-wave plate that rotates its polarization by 90 degrees (a quarter of a wave). The altered beam then enters the objective lens that focuses it on the disk surface to the required 1 micron size spot.

On the return path, light reflected from the disk surface passes back through the objective lens and on again through the quarter-wave plate where its polarization is given another 90 degree twist. Now 180 degrees out of phase, the returning beam is not passed by the beam splitter but is instead reflected perpendicularly toward the photodetector.

Recording formats

While optical disks as a family share similar optical assemblies, disk fabrication materials and techniques, they can differ considerably in their recording formats. There are four different formats now in current use. The most common are the Constant Angular Velocity (CAV) and the Constant Linear Velocity (CLV) formats. The other two are modified versions of the above called, appropriately enough, Modified Constant Angular Velocity (MCAV) and Modified Constant Linear Velocity (MCLV; also called Quantized Linear Velocity--QLV).

In the CAV format, the sequences of pits and lands are usually arranged into equal capacity concentric tracks and the disk drive rotates the disk at a constant rate (angular velocity). This causes the length of both the pits and the lands to become elongated as their position moves away from the center of the disk because the outer surface of the disk platter passes beneath the optical assembly at a faster linear rate than does the inner surface. This elongation causes the storage density on the surface of a CAV format WORM disk to be less at the outer edge of the disk than at the center.

In the CLV format, the sequence of pits and lands forms a single spiral track and the rate at which the disk platter rotates is adjusted by the drive to match the position of the optical assembly. The disk rotates faster when accessing the inner surface and slower when accessing the outer surface. This adjustment ensures that the disk surface passes beneath the assembly and its optics at a constant rate (linear velocity). Adjusting the rotation rate prevents the pits and lands from becoming elongated and results in a constant storage density across the surface of the disk.

The two formats, CAV and CLV, have complimentary advantages and disadvantages. As stated above, CLV disks have a greater storage capacity than CAV disks, but they also tend to have slower seek times. This is a consequence of the extra time required to accelerate the disk to the rotation rate that matches the new position of the access mechanism. So while a CAV disk may store less data than a CLV disk, accesses to that data are slightly faster.

The Modified CAV and Modified CLV formats each combine features of the CAV and CLV formats in an attempt to obtain greater storage capacities and seek times. A disk employing the MCAV format rotates at a constant rate and also has a nearly constant storage density across its surface. Its concentric tracks are divided into equal track capacity bands. Each band has one more sector in each of its tracks than the next innermost band that it surrounds. A clocking scheme adjusts to the varying linear rates at which pits and lands pass beneath the optical assembly. The MCLV scheme is similar except that the disk platter rotates at a different rate for each band.

All the formats are available for WORM disks. Erasable disks are available in the CAV and MCAV formats. CD-ROM disks are standardized and use only the CLV format.

6.3 FILE SYSTEMS

Virtually all of the research into file systems for optical disks has concentrated on WORM and CD-ROM optical disks. These are the oldest forms of optical disk technology and the types with the most differences from magnetic disks. Commercially available erasable optical disks are a relatively recent phenomenon and are close enough in capabilities to magnetic disks that conventional file systems can usually be adapted. Thus, we concentrate our discussion on file systems developed for WORM and CD-ROM optical disks. But, before we proceed we first present a short discussion on some technical issues affecting the implementation of file structures on optical disks.

6.3.1 Technical Issues for File Systems

Optical disk technology is similar enough to magnetic disk technology that the same types of file structures that are used on magnetic disks can usually be used in some form on optical disks. There are, however, some differences that can make some choices better than others.

For example, conventional pointer linked file structures (e.g., B-trees) are a poor choice for WORM optical disks. The reason for this is that each modification of the file usually requires some of the pointers linking the file structure together to change; rebalancing a B-tree after an insert or delete is a good example. Since storage space cannot be reclaimed on a WORM optical disk, changing the value of a pointer requires the new value to be stored in a new disk sector, consuming space. Thus, in the course of normal file maintenance operations, extra storage space is consumed.

This type of problem is present when linked lists or trees are used (as is true for B-trees). If an element or node is modified, then it must be stored on the disk in a new position and all pointers to the position of the old version of the node must be updated to point to the position of the new version. Changing those pointers in turn changes their positions on the disk requiring any pointers to their old positions to be changed as well. This means that if an element of a list is modified, all elements between it and the head of the list must be duplicated. The same is true for trees: if a node is modified, all nodes on the path up to, and including, the root, must be duplicated.

This is not the only problem with using pointer linked structures on WORM optical disks; the direction that the pointers point is also of considerable consequence. A forward pointer is one that points from an older disk sector (written before) to a younger disk sector (written after), and a backward pointer points from a younger sector to an older sector. It is a characteristic of WORM optical disks that it is not possible to detect a bad disk sector before the sector has been written; thus, if a forward pointer is stored on the disk, there is a chance that the sector is points to may subsequently turn out to be unusable, making the forward pointer invalid. Sector substitution schemes that might deal with this problem can be envisioned, but a better solution is to simply avoid the problem where possible by using backward pointers. Backward pointers do not have this problem as they always point to valid sectors.

Preallocation of disk space on a WORM disk can also lead to problems. Most disk drives are unable to differentiate between reading a blank sector and reading a defective sector (i.e., they both are unreadable). Thus, if space has been reserved on a disk, it will be impossible to detect the difference between the beginning of preallocated (and blank) space on the disk and a previously written sector that is now unreadable (possibly because of a media defect, dirt, or scratches). This inability will render unreliable any organization that depends on preallocated space on a WORM disk.

A further aspect of WORM optical disk technology to consider when designing file structures is the granularity of data registration. On a WORM disk the smallest unit of data storage is the disk sector which has a typical capacity of 1 kilobyte. It is not possible to write a portion of a disk sector at one time and then another portion of the same sector later. A sector may be written once, and only once, and can never be updated. This restriction comes from the error correction code added to the data when the disk sector is written. If a sector is modified after its initial write, its contents would become inconsistent with the error correction code and the drive would either correct the "error" without comment or report a bad sector. With the inability to update the contents of a disk sector, there is a danger of wasting the storage capacity of the disk through sector fragmentation.

Not all of the characteristics of optical disks lead to problems for the implementations of file structures. On CD-ROM disks, for example, since the data never changes some optimizations are possible. Also, the spiral track found on CLV format disks lends itself nicely to hash file organizations by allowing hash buckets of arbitrary size to be created, allowing bucket overflow to be eliminated.

6.3.2 Write-Once B-Tree

The Write-Once B-Tree (WOBT) of Easton (1986) is a variation of a B-tree organization developed for WORM disks. The difference between the two structures is the manner in which the contents of the tree's nodes are maintained.

In a conventional B-tree, when a node is modified, its previous value or state is discarded in favor of the new value. This practice recycles storage space, but is not desirable in all situations. For some applications the ability to access any previous state of the tree is a distinct advantage; the WOBT allows this.

The WOBT manages the contents of the tree nodes in a manner that preserves all versions of a node throughout the life of the tree. This is accomplished by not overwriting nodes when they change, but instead appending new time-stamped entries to the node. The most current state of the tree is represented by the latest version of each entry (including pointers) in a node.

When a node is split in a WOBT, only those entries that are valid at the current time are copied into the new version of the node. The old version is retained intact. Deletions are handled by inserting a deletion marking record that explicitly identifies the record to be deleted and its deletion time.

The diagram in Figure 6.1 illustrates a WOBT with three nodes containing the records C, D, F, G, and H. The diagram shows the root node labeled 1 and two children, nodes 2 and 3. The root has an extra entry used to link different versions of the root node together. Being the first root of the tree, node 1 has a NULL (0) pointer for this entry. The rest of the entries in the root point to the children of the root. The first data entry indicates that C is the highest record in node 2, and F the highest in node 3.

Figure 6.1: Write-once B-tree

When a record A is added to the tree, it is simply appended to node 2 in the space available for it, and an entry is propagated up to the parent of node 2, the root. This is illustrated in Figure 6.2.

Figure 6.2: Write-once B-tree after insertion of "A"

The root now contains two entries which point to node 2. When we access the tree with respect to the current time, we use only the most recent values of the record/pointer pairs, in this case A/2 and F/3, which are found by a sequential search of the node. If we access the tree with respect to a time before the insertion of A, we find the record/pointer pairs C/2 and F/3. The record A in node 2 would not be found in that search because its later time-stamp would disqualify it.

When we insert a further record "J" into the tree, we must split both node 3 and the root node 1. The result is illustrated in Figure 6.3. When we split node 3 we end up with two new nodes 4 and 5. When node 1 is "split" in this case, we can make room in the new node by not including the data/pointer pair C/2 which is now obsolete. This results in one new node, node 6. The extra entry in the new root node now points back to the old root, node 1, so that accesses with respect to a previous time are still possible. Node 3 is no longer in the current version of the tree, but can still be accessed as part of the old version of the tree (prior to the insertion of "J") through the old root.

Figure 6.3: Write-once B-tree after insertion of "J"

Features to note about the WOBT are that search times will be slower than for a conventional B-tree as extra time will be required for a sequential search of each node to determine the current values of the data/pointer pairs. The WOBT also has problems with sector fragmentation since all modifications of the tree are stored as they are made directly on the WORM disk. It is also not particularly efficient with frequent updates to a single record, since each update must be stored on the disk in at least one disk sector.

However, despite these drawbacks, the WOBT is a robust (using backward pointers only) and elegant solution to the problem of efficiently implementing a multiway tree on an indelible storage device.

6.3.3 Time-Split B-Tree

The Time-Split B-tree (TSBT) of Lomet and Salzberg (1989) is an enhancement of the Write-once B-tree that eliminates some of its problems while adding to its utility. The basic structure and operation of the TSBT are identical to that of the WOBT. The difference between the two is that the TSBT distributes the contents of the tree over both a magentic disk and a WORM optical disk, and employs a slightly different approach to node splitting that reduces redundancy.

Combining magnetic and optical storage technologies allows their individual strengths to complement each other. This is an idea also employed in the buffered hashing organization of Christodoulakis and Ford (1989a). Write-once optical disks have a very low cost per stored bit and, as well, ensure that data cannot be accidently or maliciously deleted. Magnetic disks offer faster access and allow modifications to stored data without consuming storage space. A TSBT employs a magnetic disk to store the current changeable contents of the tree and the write-once disk to store the unchangeable historical contents.

The migration of the historical contents of the TSBT to the write-once disk is a consequence of node splitting. The TSBT divides node splits into two types: Time Splits and Key Splits. A time split occurs when a node is full of many historical (i.e., not in the current version of the tree) entries. A time split makes a new version of the node that omits the historical versions of records. In the TSBT, the old version of the node is removed from the magnetic disk and stored on the WORM optical disks.

A key split is the normal type of split associated with conventional B-trees and occurs when a node is full and most of the records are current (in a conventional B-tree, the records are always current). The result is two new nodes stored on the magnetic disk, each with roughly half of the records (historical and current) of the original node.

The advantages of the Time-split B-tree over the Write-once B-tree are many.The magnetic disk allows faster access to the current data than is possible if it were stored on an optical disk alone. It also allows transactions to make nonpermanent entries before they commit; these can be deleted if the transaction aborts. Records can also be updated without consuming disk space. Lower sector fragmentation is also a result, since buffering on the magnetic disk tends to reduce fragmentation by buffering many small changes into one larger change.

The splitting policy of a TSBT can be skewed to favor different objectives. If the total amount of storage space consumed is a concern, then key space splits should be favored as time splits tend to increase redundancy in the database. If the size of the current version of the B-tree (i.e., the part of the tree being stored on the magnetic disk) is a concern, then time splits should be favored as they free occupied space on the magnetic disk.

When implementing the TSBT, care must be taken to recognize that the size of the tree is limited not by the capacity of the optical disk, but by the capacity of the magnetic disk, as it stores all of the current contents of the tree. If the magnetic disk is full, the amount of remaining storage space on the optical disk will be irrelevant.

6.3.4 Compact Disk File System

The Compact Disk File System (CDFS) of Garfinkel (1986) is a system independent hierarchical file system for WORM optical disks. The goals of the CDFS are to be completely transportable across a variety of modern operating systems, to make efficient use of storage space, and to have a relatively high retrieval performance.

Unlike the write-once and time-split B-trees, the CDFS does not provide a structure for the organization of records, but rather a structure for the organization of groups of complete files. The application for which it is primarily intended is to organize files that experience few modifications, such as those belonging to a source code archive. The smallest unit of registration in the CDFS organization is the file.

The basic unit of organization in the CDFS is called a "transaction." A transaction results from the process of writing a complete group of files on the optical disk. All the files in a transaction group are placed on the disk immediately adjacent to the position of the previous transaction. Each individual file is stored contiguously. At the end of a transaction, an updated directory list for the entire file system is stored along with an "End of Transaction" (EOT) record. The EOT record contains a link to the EOT record of the previous transaction allowing access to historical versions of the organization (a dummy EOT record is stored at the start of an empty disk). The last transaction on the disk is the starting point for all accesses and the directory list it contains represents the current version of the file hierarchy.

The CDFS contains three types of "files": regular files, directories, and links. Each file is given a unique sequence number for the file system and a version number. If a file is updated by writing a new copy of it at some later time, the new copy retains the sequence number, but receives a new version number.

Each stored file entry consists of two parts, a file header and a file body. The header, which is invisible to a user, stores a large amount of explicit information about the file. This is an attempt by the CDFS to span the entire space of file charecteristics that any given operating system might record or require. For example, the file header contains the name of the owner of the file; on some systems (e.g., UNIX) this information must be derived by consulting a system database. This explicit information allows the contents of a single disk employing a CDFS to appear to be a native file system on more than one operating system (with appropriate drivers for each system).

A directory is a special type of file that contains entries identifying other files known as members. These entries include pointers to the disk positions of the members and other information such as file sizes and modification times. This extra information is redundant since it is also stored in the file header, but it serves to improve the performance of directory list operations by eliminating the seeks required to access each member.

A link entry is simply a pointer to a file or a directory and allows a file to be a member of more than one directory.

The directory list stored at the end of the files in the transaction is an optimization to reduce seeks. It is a list of the positions of all current directories and subdirectories in the hierarchical file system. Using the directory list improves performance by reducing the seeks needed to traverse the file directory tree.

The diagram in Figure 6.4 illustrates how an instance of a CDFS is stored in a series of disk sectors. The example is for two transactions for a CDFS consisting of three files. The second transaction is used to store a second expanded version of the second file in the file system. The arrows in the diagram represent the pointers which link the various constituents of the CDFS together. For example, (backward) pointers exist between the EOT records, between each EOT record and its directory list, between the directory list and the directories (in this case just one, the root), and between the root and the three files.

Figure 6.4: State of compact disk file system after two transactions

The CDFS is an efficient means of organizing an archive of a hierarchical file system. Its main drawback is that it does not allow efficient updates to files. Any change to a single file requires the entire file to be rewritten along with a new copy of the directory list. Extra information could be added to the file header to allow files to be stored noncontiguously. This would allow portions of the file to be changed while other parts remained intact.

The robustness of the CDFS inherent in the degree of redundancy found in the organization, coupled with the access it allows to all previous versions of a file, makes it ideal for use in storing file archives.

Being relatively system independent, the CDFS is also an excellent organization for data interchange. It would be possible, for example, to copy a complete UNIX file system to an optical disk employing the CDFS and then transport it to an appropriate VMS installation and access the archived UNIX file system as if it were a native VMS file system.

6.3.5 The Optical File Cabinet

The Optical File Cabinet (OFC) of Gait (1988) is another file system for WORM optical disks. Its goals are quite different from those of the CDFS described previously. Its main objective is to use a WORM disk to simulate an erasable file system such as that found on a magnetic disk, and appear to an operating system just as if it was any other magnetic disk file system. It does this by creating a logical disk block space which can be accessed and modified at random on a block-by-block basis through a conventional file system interface.

The mapping between the logical and physical blocks is provided by a structure called the File System Tree (FST) which resides on the WORM optical disk, (see Figure 6.5). To translate between logical and physical blocks, the logical block number is used to find a path through the FST to a leaf; the leaves of the tree are the physical disk blocks of the WORM optical disk.

Figure 6.5: File system tree (FST) for optical file cabinet

Both the interior nodes and the leaves of the tree are buffered in main memory. The buffers are periodically flushed (e.g., every 5 minutes) and written on the optical disk in a process called checkpointing. Each flush results in a new version of the FST residing on the WORM disk. The roots of each of the different versions of the FST are pointed to by members of a list also residing on the optical disk called the time-stamp list. The current version of the FST is found by traversing the time-stamp list to find its last element.

The time-stamp list is implemented by a "continuation list." A continuation list consists of a series of disk blocks linked together via forward pointers. The pointer in the last element of a continuation list contains a pointer to a preallocated but empty disk block. The end of the list is detected when an attempt to read the next disk block in the list fails. In the OFC, it is assumed that the reason for the failure is because the preallocated (and empty) sector was read.

The use of forward pointers by the Optical File Cabinet file system seems to be a major flaw. As discussed previously, forward pointers are unreliable on WORM optical disks. It would also appear that if any of the disk sectors in which the time-stamp list is stored were ever damaged (e.g., by a permanent scratch or temporary dirt), the entire organization would be crippled. The system would interpret the bad sector read as the end of the time-stamp list and access historical contents of the file system rather than the current contents. At the next checkpoint time it would attempt to write to the damaged sector and find itself in a dead end. Even if none of the current sectors of the time-stamp list are damaged, the list could still encounter a preallocated sector that is damaged (e.g., media defect). An error such as this would require the entire time-stamp list to be copied and a scheme for identifying and locating alternate time stamp lists on the disk.

The utility of the Optical File Cabinet file system is also difficult to establish. As a replacement for magnetic disks it is an expensive substitute, as the write-once disks eventually fill up and must be replaced.

The most appropriate application seems to be in fault-tolerant systems that require their entire file systems to be permanently checkpointed at frequent intervals.

For systems with less demanding requirements the Write-once B-tree or the Time-split B-tree implemented for specific files might be more appropriate. The same is true for the archiving of copies of specific files (e.g., source code), where an organization such as the Compact Disk File System would be a better choice.

6.3.6 Buffered Hashing (Bhash)

Buffered Hashing (Bhash) of Christodoulakis and Ford (1989a) is a Hash file organization for WORM optical disks which employs a rewritable buffer to obtain performance improvements and reductions in storage space consumption. The buffer helps to reduce sector fragmentation, but its main purpose is to group many small changes to the file (record insertions) into one larger insertion. This is important because preallocating space for a hash bucket, as is done on magnetic disks, is not really a viable option for a WORM disk. This is because it is difficult to distinguish between a preallocated empty disk sector and an occupied but damaged sector; both are read as bad sectors.

Since successive insertions are unlikely to be to the same hash bucket, it is likely that the contents of a hash bucket will be stored in different (unpredictable) spots spread around the disk. By having a larger insertion, the degree to which the contents of a bucket are spread around the disk as a function of the number of record insertions is reduced. This, in turn, will reduce the number of seeks required to retrieve the contents of a hash bucket.

The rewritable buffer employed can be either main memory or a magnetic disk. As records are added to the file, they are first stored in the buffer and linked into a list of other records that belong to their hash bucket. When the buffer is full, the largest group of records belonging to a single hash bucket are removed (flushed) from the buffer and stored as a contiguous unit on the WORM optical disk. This group is linked to a list on the optical disk of other such groups that were flushed previously and that belong to the same bucket. The space in the buffer freed by the flush is then reused to store more records belonging to any hash bucket.

The length of the list (number of different groups) on the WORM disk will determine the number of seeks required to complete the retrieval of a bucket. If adding a new group to the list would make the length exceed some retrieval performance limit on the number of seeks required to access a bucket, the groups of the list are merged together to form a single larger contiguous group. This group is stored on the WORM disk as well, and is used in subsequent retrievals of the bucket. The new bucket group also becomes the first member of a new group list, to which subsequent bucket flushes will be appended. When the length of that list exceeds the limit, it too will be merged.

Deleted records are either removed from the buffer if they have not yet been flushed at the time of the deletion, or are marked as deleted by storing a deletion record in the bucket.

Because pointers to all previous groups on the disk are available and can be stored on a magnetic disk (such as the one being used for the buffer), access to all previous versions of the database is possible.

The parameters of the organization can be adjusted to obtain either fast retrieval performance at the expense of high storage space consumption or low disk space consumption at the expense of slower retrieval performance. Retrieval performance is affected primarily by the length limits placed on the group lists. If high performance is desired, the allowed length of the lists will be very short, resulting in many merges and increased storage consumption. If low storage consumption is required, the allowed length of the lists will be quite long, resulting in very infrequent merges but more seeks.

The size of the buffer and the number of buckets also plays a role in determining the performance of the organization. If the buffer is large, it will obviously require less flushing and hence fewer merges will occur and less space will be consumed. If the number of buckets is very large, then we can expect that the size of a group flushed from the buffer will be small, so flushes will occur more often, this will consume more space since it will increase the number of merges and in turn the number of redundantly stored records.

The Bhash file organization is a good method for implementing a hash file organization on a WORM optical disk. It can be tuned by an implementor to make efficient use of storage space or to meet strict retrieval performance demands. It also can function as a roll-back database, giving access to all previous versions of each record ever stored in the hash file.

6.3.7 Balanced Implicit Multiway Trees (BIM Trees)

The static nature of data stored on a CD-ROM optical disk allows conventional B-tree structures to be fine tuned to produce a completely balanced multiway tree that does not require storage space for pointers. Such a tree is called a Balanced Implicit Multiway Tree or a BIM tree, as presented in Christodoulakis and Ford (1989b). Because all the data to be stored on a CD-ROM disk is available at the time the disk is created, it is possible to preconstruct a perfectly balanced multitree for the data. And given that the size of all nodes of the tree and their layout is known, no pointers need to be stored since it is easy to compute the position on the disk for a given node in the tree.

When constructing a BIM tree, it is possible to choose node sizes and layouts that will improve the expected retrieval performance for accesses from the tree. If the node size is chosen such that a parent node and all of its children can fill within the viewing window of the objective lens of the disk drive, it will be possible to eliminate the seek required to traverse the link between the parent and one of its children. For example, with the root buffered in main memory and each of the second level nodes stored with all of their children within the viewing window of the objective lens, only a single seek, to the second level, would be required to retrieve any record within a three-level BIM tree.

6.3.8 Hashing for CD-ROM and CLV Format Optical Disks

The spiral track that is characteristic of Constant Linear Velocity (CLV) format optical disks such as CD-ROM is ideal for implementing hashing file organizations and can guarantee single seek access as discussed in Christodoulakis and Ford (1989b).

The biggest complication in a hashing organization, and the biggest source of performance degradation, is the resolution of hash bucket overflows. Buckets overflow because they are usually associated with a physical division on the storage device that has a finite capacity, either a track or a cylinder. The spiral track, which can be read continuously for the entire capacity of CLV format disks, allows hash buckets to be as large or as small as is necessary to store their contents. With a spiral track there is no need to impose an arbitrary physical limit (up to the capacity of the disk) on a bucket. This is particularly true for CD-ROM disks on which the data never changes.

On such disks, hash buckets can be laid out along the spiral track one after another. To determine the position of each bucket, a small bucket position translation table recording the beginning position of each bucket can be used. With the translation table in main memory, the contents of a bucket can be accessed with a single seek. This last feature is particularly attractive for optical disks, as they have relatively slow seek times.

REFERENCES

CHRISTODOULAKIS, S., and D. A. FORD. 1989a, June. Retrieval Performance Versus Disc Space Utilization on WORM Optical Discs. Paper presented at the annual meeting of the Special Interest Group for the Management of Data of the Association of Computing Machinery (ACM SIGMOD'89), Portland, Oregon, 306-14.

CHRISTODOULAKIS, S., and D. A. FORD. 1989b, June. File Organizations and Access Methods for CLV Optical Disks. Paper presented at the annual meeting of the Special Interest Group for Information Retrieval of the Association of Computing Machinery (ACM SIGIR'89), Cambridge, Massachusetts.

EASTON, M. C. 1986. "Key-Sequence Data Sets on Indelible Storage." IBM Journal of Research and Development, 30(3), 230-41.

GAIT, J. 1988. "The Optical File Cabinet: A Random-Access File System for Write-Once Optical Disks." Computer, 21 (6), 11-22.

GARFINKEL, S. L. 1986. "A File System For Write-Once Media." Technical Report MIT Media Lab, September.

KERR, J. 1877. "On the Rotation of the Plane of Polarization by Reflection from the Pole of a Magnet. Philosophical Magazine, 3, 321-43.

LOMET, D., and B. SALZBERG 1989, June. Access Method for Multiversion Data. Paper presented at the annual meeting of the Special Interest Group for the Management of Data of the Association of Computing Machinery (ACM SIGMOD'89), Portland, Oregon, 315-24.

Go to Chapter 7     Back to Table of Contents