11-1 Integer Square Root

By the "integer square root" function, we mean the function graphics/11icon01.gifTo extend its range of application and to avoid deciding what to do with a negative argument, we assume x is unsigned. Thus, 0 x 232 - 1.

Newton's Method

For floating-point numbers, the square root is almost universally computed by Newton's method. This method begins by somehow obtaining a starting estimate g0 of graphics/11icon02.gifThen, a series of more accurate estimates is obtained from

graphics/11icon03.gif

 

The iteration converges quadratically梩hat is, if at some point is accurate to n bits, then gn + 1 is accurate to 2n bits. The program must have some means of knowing when it has iterated enough, so it can terminate.

It is a pleasant surprise that Newton's method works fine in the domain of integers. To see this, we need the following theorem:

graphics/11icon04.gif

 

That is, if we have an integral guess gn to graphics/11icon05.gifthat is too high, then the next guess gn + 1 will be strictly less than the preceding one, but not less than graphics/11icon06.gifTherefore, if we start with a guess that's too high, the sequence converges monotonically. If the guess graphics/11icon07.gifthen the next guess is either equal to gn or is 1 larger. This provides an easy way to determine when the sequence has converged: If we start with graphics/11icon08.gifconvergence has occurred when gn + 1 gn, and then the result is precisely gn.

The case a = 0, however, must be treated specially, because this procedure would lead to dividing 0 by 0.

Proof. (a) Because gn is an integer,

graphics/11icon09.gif

 

Because graphics/11icon10.gifand gn is an integer, graphics/11icon11.gifDefine e by graphics/11icon12.gifThen e > 0 and

graphics/11icon13.gif

 

(b) Because graphics/11icon14.gifso that graphics/11icon15.gifHence we have

graphics/11icon16.gif

 

The difficult part of using Newton's method to calculate graphics/11icon17.gifis getting the first guess. The procedure of Figure 11-1 sets the first guess g0 equal to the least power of 2 that is greater than or equal to graphics/11icon18.gifFor example, for x =4, g0 = 2, and for x = 5, g0 = 4.

Figure 11-1 Integer square root, Newton's method.
int isqrt(unsigned x) {
牋 unsigned x1; 
牋爄nt s, g0, g1; 
 
牋 if (x <= 1) return x; 
牋爏 = 1; 
牋爔1 = x - 1; 
牋爄f (x1 > 65535) {s = s + 8; x1 = x1 >> 16;} 
牋爄f (x1 > 255)牋 {s = s + 4; x1 = x1 >> 8;} 
牋爄f (x1 > 15)牋?{s = s + 2; x1 = x1 >> 4;} 
牋爄f (x1 > 3)牋牋 {s = s + 1;} 
 
牋 g0 = 1 << s;牋牋牋牋牋牋牋?// g0 = 2**s. 
牋爂1 = (g0 + (x >> s)) >> 1;?// g1 = (g0 + x/g0)/2. 
 
牋 while (g1 < g0) {牋牋牋牋牋 // Do while approximations 
牋牋牋g0 = g1;牋牋牋牋牋牋牋牋 // strictly decrease. 
牋牋牋g1 = (g0 + (x/g0)) >> 1; 
牋爙 
牋爎eturn g0; 
} 

Because the first guess g0 is a power of 2, it is not necessary to do a real division to get g1; instead, a shift right suffices.

Because the first guess is accurate to about 1 bit, and Newton's method converges quadratically (the number of bits of accuracy doubles with each iteration), one would expect the procedure to converge within about five iterations (on a 32-bit machine), which requires four divisions (because the first iteration substitutes a shift right). An exhaustive experiment reveals that the maximum number of divisions is five, or four for arguments up to 16,785,407.

If number of leading zeros is available, then getting the first guess is very simple: Replace the first seven executable lines in the procedure above with

if (x <= 1) return x; 
s = 16 - nlz(x - 1)/2; 

Another alternative, if number of leading zeros is not available, is to compute s by means of a binary search tree. This method permits getting a slightly better value of g0: the least power of 2 that is greater than or equal to graphics/11icon19.gifFor some values of x, this gives a smaller value of g0, but a value large enough so that the convergence criterion of the theorem still holds. The difference in these schemes is illustrated below.

Range of x for Figure 11-1

Range of x for Figure 11-2

First Guess g0

0

0

0

1

1 to 3

1

2 to 4

4 to 8

2

5 to 16

9 to 24

4

17 to 64

25 to 80

8

65 to 256

81 to 288

16

?/p>

?/p>

?/p>

228 + 1 to 230

(214 + 1)2 to (215 + 1)2 - 1

215

230 + 1 to 232 - 1

(215 + 1)2 to 232 - 1

216

This procedure is shown in Figure 11-2. It is convenient there to treat small values of x (0 x 24) specially, so that no divisions are done for them.

Figure 11-2 Integer square root, binary search for first guess.
int isqrt(unsigned x) {
牋 int s, g0, g1; 
 
牋 if (x <= 4224) 
牋牋牋if (x <= 24) 
牋牋牋牋爄f (x <= 3) return (x + 3) >> 2; 
牋牋牋牋爀lse if (x <= 8) return 2; 
牋牋牋牋爀lse return (x >> 4) + 3; 
牋牋牋else if (x <= 288) 
牋牋牋牋爄f (x <= 80) s = 3; else s = 4; 
牋牋牋else if (x <= 1088) x = 5; else s = 6; 
牋爀lse if (x <= 1025*1025 - 1) 
牋牋牋if (x <= 257*257 - 1) 
牋牋牋牋爄f (x <= 129*129 - 1) s = 7; else s = 8; 
牋牋牋else if (x <= 513*513 - 1) s = 9; else s = 10; 
牋爀lse if (x <= 4097*4097 - 1) 
牋牋牋if (x <= 2049*2049 - 1) s = 11; else s = 12; 
牋爀lse if (x <= 16385*16385 - 1) 
牋牋牋if (x <= 8193*8193 - 1) s = 13; else s = 14; 
牋爀lse if (x <= 32769*32769 - 1) s = 15; else s = 16; 
牋爂0 = 1 << s;牋牋牋牋牋牋牋?// g0 = 2**s. 
 
牋 // Continue as in Figure 11-1. 

The worst-case execution time of the algorithm of Figure 11-1, on the basic RISC, is about 26 + (D + 6)n cycles, where D is the divide time in cycles and n is the number of times the while-loop is executed. The worst-case execution time of Figure 11-2 is about 27 + (D + 6)n cycles, assuming (in both cases) that the branch instructions take one cycle. The table below gives the average number of times the loop is executed by the two algorithms, for x uniformly distributed in the indicated range.

x

Figure 11-1

Figure 11-2

0 to 9

0.80

0

0 to 99

1.46

0.83

0 to 999

1.58

1.44

0 to 9999

2.13

2.06

0 to 232 - 1

2.97

2.97

If we assume a divide time of 20 cycles and x ranging uniformly from 0 to 9999, then both algorithms execute in about 81 cycles.

Binary Search

Because the algorithms based on Newton's method start out with a sort of binary search to obtain the first guess, why not do the whole computation with a binary search? This method would start out with two bounds, perhaps initialized to 0 and 216. It would make a guess at the midpoint of the bounds. If the square of the midpoint is greater than the argument x, then the upper bound is changed to be equal to the midpoint. If the square of the midpoint is less than the argument x, then the lower bound is changed to be equal to the midpoint. The process ends when the upper and lower bounds differ by 1, and the result is the lower bound.

This avoids division, but requires quite a few multiplications?6 if 0 and 216 are used as the initial bounds. (The method gets one more bit of precision with each iteration.) Figure 11-3 illustrates a variation of this procedure, which uses initial values for the bounds that are slight improvements over 0 and 216. The procedure shown in Figure 11-3 also saves a cycle in the loop, for most RISC machines, by altering a and b in such a way that the comparison is b a rather than b - a 1.

Figure 11-3 Integer square root, simple binary search.
int isqrt(unsigned x) {
牋 unsigned a, b, m;牋牋牋牋牋?// Limits and midpoint. 
 
牋 a = 1; 
牋燽 = (x >> 5) + 8;牋牋牋牋牋?// See text. 
牋爄f (b > 65535) b = 65535; 
牋燿o {
牋牋?m = (a + b) >> 1; 
牋牋牋if (m*m > x) b = m - 1; 
牋牋牋else牋牋牋牋 a = m + 1; 
牋爙 while (b >= a); 
牋爎eturn a - 1; 
} 

The predicates that must be maintained at the beginning of each iteration are graphics/11icon20.gifand graphics/11icon21.gifThe initial value of b should be something that's easy to compute and close to graphics/11icon22.gifReasonable initial values are x, x ?4 + 1, x ?8 + 2, x ?16 + 4, x ?32 + 8, x ?64 + 16, and so on. Expressions near the beginning of this list are better initial bounds for small x, and those near the end are better for larger x. (The value x ?2 + 1 is acceptable, but probably not useful, because x ?4 + 1 is everywhere a better or equal bound.)

Seven variations on the procedure shown in Figure 11-3 can be more or less mechanically generated by substituting a + 1 for a, or b - 1 for b, or by changing m = (a + b) ?2 to m = (a + b + 1) ?2, or some combination of these substitutions.

The execution time of the procedure shown in Figure 11-3 is about 6 + (M + 7.5)n, where M is the multiplication time in cycles and n is the number of times the loop is executed. The table below gives the average number of times the loop is executed, for x uniformly distributed in the indicated range.

x

Average Number of Loop Iterations

0 to 9

3.00

0 to 99

3.15

0 to 999

4.68

0 to 9999

7.04

0 to 232 - 1

16.00

If we assume a multiplication time of 5 cycles and x ranging uniformly from 0 to 9999, the algorithm runs in about 94 cycles. The maximum execution time (n = 16) is about 206 cycles.

If number of leading zeros is available, the initial bounds can be set from

b = (1 << (33 - nlz(x))/2) - 1; 
a = (b + 3)/2; 

That is, graphics/11icon23.gifThese are very good bounds for small values of x (one loop iteration for 0 x 15), but only a moderate improvement, for large x, over the bounds calculated in Figure 11-3. For x in the range 0 to 9999, the average number of iterations is about 5.45, which gives an execution time of about 74 cycles, using the same assumptions as above.

A Hardware Algorithm

There is a shift-and-subtract algorithm for computing the square root that is quite similar to the hardware division algorithm described in Figure 9-2 on page 149. Embodied in hardware on a 32-bit machine, this algorithm employs a 64-bit register that is initialized to 32 0-bits followed by the argument x. On each iteration, the 64-bit register is shifted left two positions, and the current result y (initially 0) is shifted left one position. Then 2y + 1 is subtracted from the left half of the 64-bit register. If the result of the subtraction is nonnegative, it replaces the left half of the 64-bit register, and 1 is added to y (this does not require an adder, because y ends in 0 at this point). If the result of the subtraction is negative, then the 64-bit register and y are left unaltered. The iteration is done 16 times.

This algorithm was described in 1945 [JVN].

Perhaps surprisingly, this process runs in about half the time of that of the 64 ?32 32 hardware division algorithm cited, because it does half as many iterations and each iteration is about equally complex in the two algorithms.

To code this algorithm in software, it is probably best to avoid the use of a doubleword shift register, which requires about four instructions to shift. The algorithm in Figure 11-4 [GLS1] accomplishes this by shifting y and a mask bit m to the right. It executes in about 149 basic RISC instructions (average). The two expressions y | m could also be y + m.

Figure 11-4 Integer square root, hardware algorithm.
int isqrt(unsigned x) {
牋 unsigned m, y, b; 
 
牋 m = 0x40000000; 
牋爕 = 0; 
牋爓hile(m != 0) {牋牋牋牋牋牋?// Do 16 times. 
牋牋牋b = y | m; 
牋牋牋y = y >> 1; 
牋牋牋if (x >= b) {
牋牋牋牋 x = x - b; 
牋牋牋牋爕 = y | m; 
牋牋牋} 
牋牋牋m = m >> 2; 
牋爙 
牋爎eturn y; 
} 

The operation of this algorithm is similar to the grade-school method. It is illustrated below, for finding graphics/11icon24.gifon an 8-bit machine.

?011 0011?x0牋 Initially, x = 179 (0xB3). 
- 1牋牋牋牋 b1 
牀棗棗棗棗 
?111 0011?x1牋 0100 0000?y1 
- 101牋牋牋 b2牋 0010 0000?y2 
牀棗棗棗棗 
?010 0011?x2牋 0011 0000?y2 
-?11 01牋?b3牋 0001 1000?y3 
牀棗棗棗棗 
?010 0011?x3牋 0001 1000?y3牋 (Can't subtract). 
-牋 1 1001?b4牋 0000 1100?y4 
牀棗棗棗棗 
?000 1010?x4牋 0000 1101?y4 

The result is 13 with a remainder of 10 left in register x.

It is possible to eliminate the if x >= b test by the usual trickery involving shift right signed 31. It can be proved that the high-order bit of b is always zero (in fact, b 5 ?228), which simplifies the x >= b predicate (see page 22). The result is that the if statement group can be replaced with

t = (int)(x | ~(x - b)) >> 31;牋 // -1 if x >= b, else 0. 
x = x - (b & t); 
y = y | (m & t); 

This replaces an average of three cycles with seven, assuming the machine has or not, but it might be worthwhile if a conditional branch in this context takes more than five cycles.

Somehow it seems that it should be easier than some hundred cycles to compute an integer square root in software. Toward this end, we offer the expressions below to compute it for very small values of the argument. These can be useful to speed up some of the algorithms given above, if the argument is expected to be small.

The expression

is correct in the range

and uses this many instructions (full RISC).

x

0 to 1

0

x > 0

0 to 3

1

graphics/11icon25.gif

0 to 3

2

graphics/11icon26.gif

0 to 3

2

graphics/11icon27.gif

0 to 5

2

graphics/11icon28.gif

1 to 8

2

graphics/11icon29.gif

4 to 15

2

(x > 0) + (x > 3)

0 to 8

3

(x > 0) + (x > 3) + (x > 8)

0 to 15

5

Ah, the elusive square root,

It should be a cinch to compute.

But the best we can do

Is use powers of two

And iterate the method of Newt!