The Data Structures Connection

The study of data structures is a natural link between learning a programming language and writing application programs. Data may be arranged and managed at many levels: within memory; at the programming-language level; in terms of the logical structure of an algorithm; and at the conceptual level, where a mental image of arrangements and relationships is developed. Programming tactics can differ at each of these levels, and the levels interact. For example, suppose that algorithms *Alpha* and *Beta* solve the same problem. *Alpha* is more wasteful of some resource than is *Beta*, and *Beta* is easy to implement in language A but awkward or costly in programming time to implement in language B. If programming is to be done in language A, *Beta* is the obvious choice, but if it is to be done in language B, a balance must be struck. The specific details of the application determine whether the awkwardness of *Beta* or the waste of the resource by *Alpha* is more significant.

A common example of the interaction among these levels is that the algorithms used for manipulating data in one structural arrangement are different from those used to manage data arranged in another structure. The overall plan of attack for a problem depends on a choice of data structures, which in turn may depend on features of the language to be used.

Unlock(door)

The meaning of each of the pseudocode structures used in this book is explained in the next section.

**Note**: Part II of this book, "Expansions and Applications," includes working programs realized in Pascal. They implement some of the procedures presented in pseudocode in the main part of the text.

If it weren't for the powerful effects of data structures and of other ideas, commands given to computers could be comfortably limited to six basic operations:

**1.** Copy data from one memory cell to another.

**2.** Perform arithmetic and logical operations.

**3.** Input data: place data from the outside world into memory.

**4.** Output data: retrieve data from memory and send it to the outside world.

**5.** Branch unconditionally to some command.

**6.** Branch on the truth of a condition to some command.

The essential constructs of SUE are

variable expression Assignment. The keystrokes that form the

assignment operator are a matter of syntax.

Write(v_{1},v_{2}, . . . ,v_{n}) Output the values of a list of variables or

messages. (The distinctions between Write

and Writeln in Pascal, and between the

effects of commas and semicolons in BASIC,

are features of specific languages.)

Read(v_{1},v_{2}, . . . ,v_{n}) Input the values of a list of variables. This

form, and Write, will also serve for files

when v_{1}is a file identifier.

*Read* and *Write* are intended to be more general than they are in most working languages. Variables may be of *any* structural type in SUE. The format of the output of an array, record, or other variable type is provided during the translation of SUE into a working language:

ifcondition S_{1}and S_{2}may be any sequence of

thenS1 statements. Theelsephrase may be missing. In

elseS2 a nested structure, anelsephrase is taken to

endifbe part of the most recentif.

ifA<B

then ifA<C

thenWrite(C)

elseWrite(A)

endif

endif

Another kind of selection statement is **case**, which may appear in two forms:

caseThe conditions cond_{i}must be TRUE or

cond_{1}: S_{1 }FALSE. S_{i}can be any sequence of

cond_{2}: S_{2 }statements, and will be executed iff (if and only

. if) cond_{i }is TRUE. It is the programmer's

. responsibility to make sure that only one

. condition can be TRUE; there should be no

cond_{n}: S_{n }overlap. Theelsemay be omitted, in which

else: S_{e }case if no cond_{i }is TRUE, then no execution

endcasewill take place in this statement.

case ofV Here the variable V must take on discrete

Value_{1}: S_{1 }values. It is sometimes convenient to use

Value_{2}: S_{2 }values such as Small, Medium,

. Large--instead of, for example, 1,2,3--although many

. languages do not support this feature. A

. language that restrictscasevalues to be

Value_{n}: S_{n }ordinal (ordered discrete), may require a table

else:S_{e }lookup or other map for implementation.

endcase

Loop structures may be classified by the location of the test for exit from the loop: at the top, at the bottom, or in the middle. They all must determine the beginning and the end of the loop; that is, what is included and what is not, as well as the exit-test position. The three essential loop forms in SUE are:

repeatHere S_{1}and S_{2}may be any sequence of

Sstatements. The test for exit is in the middle,_{1}

ifcondthen exit endifnot at the top or at the bottom. The

Scondition must be TRUE or FALSE. The exit_{2}

foreverpasses control to [next statement].

[next statement]

**Note**: Not all structured languages (notably standard Pascal) provide for **exit**. An exit can be made from any loop by declaring a label and using a *GOTO* (an unconditional branch).

An example of a *GOTO* in Pascal is

Pascal SUE

REPEATrepeat

Read(x);Read(x)

IF x < O THEN GOTO 1;ifx< 0then exit endif

Sum := Sum + xSumSum+x

UNTIL Sum > 100;untilSum>100

1 : Write (Sum)Write(Sum)

repeatThe condition cond must beTRUEor

SFALSE. Sis any sequence of statements.

untilcondThe test for exit occurs explicitly at the

bottom, thusSwill be executed at least once.

Smay include anexit.

whileconddoThe condition cond must beTRUEor

SFALSE. Sis any sequence of statements.

endwhileThe test for exit occurs explicitly at the

top, thusSmay not be executed at all. Smay

include anexit.

A common mode of looping is the use of a control variable as a counter to keep track of the passes through a loop. This may be done in SUE with:

forv = InittoFinalbyStepdoS is any sequence of statements. The

Snextv may be replaced byendforif

nextv convenient. ThebyStep part is

omitted when the Step value is 1. The

use ofdowntoand a negative step

value will run thae count from high

to low.

The **for** loop is equivalent to the following, with one disclaimer stated below:

vInitInit, Final,andStepmay be expressions,

whilevFinaldoand the control values derived from them on

Sentry to the loop do not change during loop

vv+Stepexecution. This is true even if they are

endwhilevariables that have assignments made to them

within the loop. The expression Step may be

negative, in which case the test for exit

becomeswhilev>=Finaldo.

Computing languages differ in the way they pass information between programs or between parts of one program. SUE simplifies the description of information- sharing as much as possible, while remaining consistent with the modularization of programs in structured languages.

A *procedure* in SUE is a segment of code that is independent in the sense that it can be invoked conveniently, it does a specified task, and it then returns control to the point immediately following its invocation. A *program* is itself a procedure that may invoke functions or other procedures, called *subprograms.* A list of parameters that provide information to or return information from a procedure may be included in the invocation and the definition. Variables not listed are considered to be global (defined and available outside of the procedure). As an example:

procedureMainprocedureAddAbs(v)

.ifv< 0

.thenSumSum - v

.elseSumSum+v

Sum0endif

AddAbs(x)end{AddAbs

Read(y)

AddAbs(y)

Write(Sum)

.

.

.

end{Main

The formal parameter *v* can be related to actual parameters *x* and *y* in two major ways:

**1.** Treat *x* (or *y*) as an expression. Its value is stored locally in *v.* No other information about the source is available to *Abs.* In this case, *v* is called a *value parameter.*

**2.** Treat *x* (or *y*) as an address. The address is stored in *v,* which becomes a *pointer.* Whenever an assignment is made in *Abs* to *v,* the assignment is actually to the (actual) parameter *x.* In this case, *v* is called a *variable parameter.*

A *function* in SUE is a procedure that returns a single value. A function, therefore, may be invoked by using it anywhere that a variable can be used. The result that it calculates becomes the value of the function name. For example:

procedureMainfunctionAbs(v)

.ifv < 0

.thenAbs -v

.elseAbs v

Sum 0endif

Sum Sum + Abs(x)end{Abs

Read(y)

Sum Sum + Abs(y)

Write(Sum)

.

.

.

end{Main

A valuable tool for presenting the logic of some algorithms (and for implementing them when it is available) is **return.** A **return** may be used to exit immediately from a subprogram at any point. If the subprogram is a function, **return** *v* indicates that *v* is to be returned as the functional value. When it is not available as a language feature, **return** is implemented much like **exit**.

The correspondence between Pascal, for example, and SUE can be quite close:

Pascal SUE

FUNCTION Min(m,n : REAL): REAL;functionMin(m,n)

LABEL 1;

BEGIN

IF (m < 0) AND (n < 0)if(m < 0)AND(n < 0)

THEN BEGINthen return0

Min := 0;

GOTO 1endif

END;ifm > n

IF m > nthen returnn

THEN Min := nelse returnm

ELSE Min := m;endif

1 : END;end{Min

For the sake of completeness, records and pointers are reviewed in this section.

**1.** The cells of a record do not need to contain data of identical type.

without an assumed initialization of *p *.

Comparisons of algorithms are made on the basis of the management of resources, and one crucial resource is execution time. It is generally not necessary to determine exact execution times, which may be data-dependent or machine-dependent anyway and which can be seriously misleading in interactive systems. There is an important difference, however, between an algorithm that takes a minute or so to run and one that takes an hour.

[statement 1]

fori= 1tondo

[statement2]

[statement3]

nexti

T(n) = 1 + 2n

This can be easily generalized by assuming that [*statement i*]* *requires *t _{i}* units of time:

T(n) =t_{1}+ (t_{2}+t_{3})n

[statement 2]:forj=1tomdo

[statement4]

nextj{t_{2}=mt_{4}

Complications arise, however, if the execution of [*statement 2*] depends upon *i*:

[statement 2]:forj=1toido

[statement4]

nextj

Now we have, sincem=i:

At this point, we need to document some of the properties of summations, because loop execution times are a sum of the times required for each pass through them.

where *k* < *n* and *f* is any function of *i*

where *a* and *b* are not functions of *i*

From equation ii, then, the value of *T*(*n*) is:

If it may be assumed that *t*_{1} = *t*_{3} = *t*_{4} = 1, then:

As a more specific example, consider:

**EP1.1 **Input *n* numbers, and list those that follow the occurrence of the minimal entry value.

programTrailMin

Read(Entry[1])

Mindex1

fori= 2tondo

Read(Entry[i])

ifEntry[i] <Entry[Mindex]

thenMindexi

endif

nexti

fori=Mindex+ 1tondo

Write(Entry[i])

nexti

end{TrailMin

Clearly, though, *Mindex* may be a function of *n* itself. In section 1.3.2 it is shown that *Mindex* can be expected to have the middle value (*n* + 1)/2, and so *T*(*n*) may be taken to be (5*n*/2) - 1/2 as a reasonable approximation. If the point of the analysis is the growth of *T*(*n*) as *n* increases, then it is enough to have found that *T*(*n*) is of the form: *a* *n* + *b*, a linear function of *n*.

The growth behavior of an algorithm is quite often taken to be the salient fact revealed by timing analysis, and it will be explored further in the next section. Growth behavior provides a way to choose between algorithms that are to be applied to large data sets.

Another way to characterize *Mindex* as a function of *n* arises from the recognition that the execution time of *Mindex* is partly determined by the distribution of values in the sequence of entries. Random variation plays a role in the behavior of algorithms, and thus statistics should play a role both in the comparison of algorithms and in the study of the data structures on which they are founded. (See section 1.3.2.)

The analysis of algorithms includes the study of the worst-case data set of a given size, the (statistical) average case, and the limits imposed by the problem itself. In some cases, results of importance are derived experimentally. In this text, the intent is in a rough but convenient measure that is fairly easy to recognize for common algorithms. That measure is the *order of growth*.

1.3.2 The Variability of Data (optional)

Not all timing functions are linear in the size of the data set being processed. Common timing functions include quadratics and other polynomials, logarithms (a base of 2 is used in such studies), exponential functions, and others. These functions grow with *n *at dramatically different rates, as Table 1.1 reveals.

Comparison of growth rates is made explicit by the following:

We need only to extract the kernel of this approach:

The most useful orders and their relative growth rates are:

[statement 1]

fori= 1tondo

[statement2]

[statement3]

nexti

The order is clearly *O*(*n*) as long as statements 1, 2, and 3 may be assigned unit time. If they call other procedures, however, then it may be necessary to use the analysis techniques at the beginning of section 1.3.

There are some important points to note about the analysis of algorithms:

If large data sets are to be handled by a procedure, then the order of its timing function is of great concern. It may be possible to recognize the order without a detailed analysis, as we will often do, and some approaches can be dismissed on that basis.

It is possible to place too much emphasis on the order of T. Many procedures are applied exclusively to relatively small data sets. If so, concern with O(T(n)) may obscure an important difference in the execution times of two algorithms.

The variability of data sets can profoundly affect the behavior of an algorithm. Much of statistics can be brought to bear on the study of computing processes in situations where it is useful to do so.

Some well-known algorithms have defied analysis, and others arise in practice that are not cost-effective to analyze in detail, even if one is prepared and able to make the analysis. It is possible, and sometimes profitable, to study procedures experimentally by varying data sets and observing the reaction of the procedures to the changes.

*Exercise E1.1 and problem P1.1 are appropriate at this time.*

When *TrailMin* (see section 1.3) is examined in terms of the order of growth, it is evident that *T*(*n*) = *O*(*n*), no matter what value *Mindex* has. For a more detailed analysis, it is necessary to study the variations of *Mindex*. The statistical tool that is required is a statistical average, the *expected value*. If a variable, *x,* can take on values with probability *P*[*x*], then the expected value is:

The possible values of *Mindex* are 1, 2, . . ., *n*. If they are all equally likely, then their probabilities are all equal to 1/*n*. Hence:

It may be known, however, that *Mindex* values tend to occur late in the sequence of entries. Suppose, for example, that the probability is *q* when *Mindex* = 1, 2*q* when *Mindex* = 2, and in general, *kq* when *Mindex* = *k,* for *k *= 1, 2, . . ., *n*. One of these has to be the value, so the probabilities sum to 1:

And so

whence:

Now the expected value of *Mindex* is:

With this result, using *E*[*Mindex*] to represent *Mindex:*

**EP1.2 **Input *n* numbers, sort them, and then display them in nonincreasing order.

A whimsical solution, which is also the solemn solution of top-down programmers, is:

programInOut(n)

Ingress(n){Input n entries into array e.

Sort(n){Arrange the entry values in nondecreasing order.

OutHaul(n){Display e in(its new)order.

end{InOut

The procedures *Ingress* and *OutHaul* are quickly analyzed:

procedureIngress(n) {O(n)

fori= 1tondo

Read(e[i])

nexti

end{Ingress

procedureOutHaul(n) {O(n)

fori= 1tondo

Write(e[i])

nexti

end{OutHaul

One sorting technique begins by searching the entries for a maximal value and exchanging it with *e*[1]. The maximal value is then in *e*[1]. A repeat finds a maximal value of the other *n* - 1 entries and exchanges it with *e*[2]. Exchanges continue until *e*[*n* - 1] is in its proper place. When the first *n* - 1 values are arranged in nonincreasing order in this way, they are all in their proper places. Figure 1.1 is a series of snapshots of the process.

procedureSelectionSort(n)

forTarget= 1ton- 1do

Largee[Target]

SpotTarget

Hun{For the index, Spot,of a maximal

{value among the remaining entries.

Tempe[Spot]

e[Spot]e[Target]

e[Target]Temp

nextTarget

end{SelectionSort

*Hunt* must be developed in order to resolve this further:

procedureHunt

fori=Target+1tondo

ife[i] >Large

thenSpoti

Largee[Spot]

endif

nexti

end{Hunt

With the *t* that was calculated above:

*Exercise **E1.2 is appropriate at this point.*

Procedure *SelectionSort* moves the maximal value to the top of *e* in a direct way; smaller values eventually move toward the bottom. Another way to produce the same effect is to slide the smallest value to the bottom of *e,* as it is encountered on the first pass, then the second-smallest to the next-to-last position on the second pass, and so on. Because the smallest item cannot be recognized until a pass is complete, the smallest item so far slides downward during each pass. (This general technique is called a *bubble sort* here and in other places, although some programmers reserve that term for particular variations.) Large values slowly rise like a bubble in liquid, one position per pass.

Snapshots of the process that follow assume that the data in an array are originally in this order:

-5 2 10 -7 0 10

*BubbleSort* proceeds by rearranging the data on the first pass as shown in Figure 1.2.

The final result of the second, third, fourth, and fifth passes is shown in Figure 1.3.

One version of the process is:

procedureBubbleSort(n)

forTarget=ndownto2do

Slide(Target) {smallest-so-far intoe[Target]

nextTarget

end{BubbleSort

procedureSlide(Target)

fori=1toTarget-1do

ife[i] <e[i+1]

thenTempe[i]

e[i]e[i+1]

e[i+1]Temp

endif

nexti

end{Slide

A number of modifications can be made to *BubbleSort,* such as:

Some modifications of *BubbleSort* are more efficient for some data sets, but not necessarily for all.

*Exercise **E1.3 is appropriate at this point.*

Neither *SelectionSort* nor *BubbleSort* seems to be natural to card players. Most people sort cards as they are received from the dealer, one at a time. The first card is simply picked up. The second card is placed on the correct side of the first. The third goes on the outside of the first two or between them, and so on. The crux of this process is that the items input so far are always in order; thus, it is easy to search through them for the proper position of the next one. Rather than combine *Ingress* with *Sort* in procedure *InOut,* we will use the same process *in situ: e*[1] is the first value, *e*[2] is either exchanged with it or not, etc. A picture of the process in snapshots is shown in Figure 1.4.

procedureInsertionSort(n)

forSpot= 2tondo

Find(Target,Spot)

Tempe[Spot]

Shift(Target,Spot)

e[Target]Temp

nextSpot

end{InsertionSort

The timing function of *InsertionSort *is:

where *f* and *s* are the timing functions for *Find* and *Shift,* respectively.

procedureFind(Target,Spot) {f= 1 + (Spot- 1)

TargetSpot

fori= 1toSpot- 1do{ =Spot

ife[i] < e[Spot]

thenTargeti

exit

endif

nexti

end{Find

procedureShift(Target,Spot)

fori=SpotdowntoTarget+ 1do

e[i]e[i- 1]

nexti{s=Spot-Target

end{Shift

If it is assumed that *Target* is as likely to be one value (see section 1.3.1) in the range 1 . . *Spot* as another, then the expected value of *Target* is

whence:

It is possible to speed up the *Find* process with a search technique called *binary search,* which will be discussed in section 7.3. Binary search will not change the order of *InsertionSort,* however, because the repeated use of *Shift *forces *InsertionSort* to be an *O*(*n*^{2}) algorithm anyway. The *Shift* operation can be avoided with a data structure, the *linked list,* which will be discussed in Chapter 3. The use of a linked list, however, precludes the use of a binary search, so the *Find *operation is relatively slow.

One might be tempted to conclude that sorting algorithms are all *O*(*n*^{2}) or larger. The proof of statements of this type is part of the theory of algorithms. The correct statement is that sorting algorithms are at least of the order *n* ln *n.* The sorting problem itself is intrinsically *O*(*n* . ln *n*), no matter what algorithms are dreamed up to solve it. Sorting algorithms that have *O*(*n* ln *n*) behavior are discussed in Chapters 8 and 9.

*Exercise E1.4 is appropriate at this point. Programs PGA.1 and PGA.2 and project PJA.1 (See Part II, section A of this book.) are appropriate but depend on familiarity with the material in Part II, section A.*

Computing systems, the programs to which they are applied, and the systems that motivate creation of the programs all tend to increase in complexity as computer science matures. They frequently cannot be characterized by analytical means. One consequence is the growing realization that the behavior of these systems and of even relatively simple programs should be studied experimentally. In practice, that involves choosing from distributions with a pseudorandom number generator and applying statistical measures to results. A very brief introduction to tools that can be used for this kind of study is to be found in Part II, section A.

Programming beyond the elementary level requires* knowledge, analysis, *and* technique.* Data structures are an essential part of the knowledge, analysis at some level is needed to compare algorithms, and technique comes with practice and with the awareness of possible variations.

Pseudocode provides a language-independent medium for expressing algorithms. One form (SUE) closely tied to Pascal and similar languages is introduced here.

An introduction to the timing of algorithms leads to the determination of the order of growth by summation or inspection. The growth behavior of algorithms provides a measure of comparison for them throughout the book. The discussion in this chapter is background for understanding the assignment of orders of growth. The timing of three basic sorting algorithms provides an example of the top-down summation approach to deriving timing functions.

A brief discussion of random processes is provided in Part II, section A, as an indication of how the average-case behavior of algorithms and the experimental approach to investigating programs can be carried out. This material is optional and left to program assignments.