5-4 Counting Trailing 0's

If the number of leading zeros instruction is available, then the best way to count trailing 0's is, most likely, to convert it to a count leading 0's problem:

If population count is available, a slightly better method is to form a mask that identifies the trailing 0's, and count the 1-bits in it [Hay2], such as

Variations exist using other expressions for forming a mask that identifies the trailing zeros of x, such as those given in Section 2-1,"Manipulating Rightmost Bits" on page 11. These methods are also reasonable even if the machine has none of the bit-counting instructions. Using the algorithm for pop(x) given in Figure 5-2 on page 66, the first expression above executes in about 3 + 21 = 24 instructions (branch-free).

Figure 5-13 shows an algorithm that does it directly, in 12 to 20 basic RISC instructions (for x 0).

Figure 5-13 Number of trailing zeros, binary search.

int ntz(unsigned x) {

   int n;

   if (x == 0) return(32);

   n = 1;

   if ((x & 0x0000FFFF) == 0) {n = n +16; x = x >>16;}

   if ((x & 0x000000FF) == 0) {n = n + 8; x = x >> 8;}

   if ((x & 0x0000000F) == 0) {n = n + 4; x = x >> 4;}

   if ((x & 0x00000003) == 0) {n = n + 2; x = x >> 2;}

   return n - (x & 1);

The n + 16 can be simplified to 17 if that helps, and if the compiler is not smart enough to do that for you (this does not affect the number of instructions as we are counting them).

Figure 5-14 shows a variation that uses smaller immediate values and simpler operations. It executes in 12 to 21 basic RISC instructions. Unlike the above procedure, when the number of trailing 0's is small, the one in Figure 5-14 executes a larger number of instructions, but also a larger number of "fall through" branches.

Figure 5-14 Number of trailing zeros, smaller immediate values.

int ntz(unsigned x) {

   unsigned y;

   int n;

   if (x == 0) return 32;

   n = 31;

   y = x <<16;?if (y != 0) {n = n -16;?x = y;}

   y = x << 8;?if (y != 0) {n = n - 8;?x = y;}

   y = x << 4;?if (y != 0) {n = n - 4;?x = y;}

   y = x << 2;?if (y != 0) {n = n - 2;?x = y;}

   y = x << 1;?if (y != 0) {n = n - 1;}

   return n;

The line just above the return statement may alternatively be coded

n = n - ((x << 1) >> 31);

which saves a branch, but not an instruction.

In terms of number of instructions executed, it is hard to beat the "search tree" [Aus2]. Figure 5-15 illustrates this procedure for an 8-bit argument. This procedure executes in seven instructions for all paths except the last two (return 7 or 8), which require nine. A 32-bit version would execute in 11 to 13 instructions. Unfortunately, for large word sizes the program is quite large. The 8-bit version above is 12 lines of executable source code and would compile into about 41 instructions. A 32-bit version would be 48 lines and about 164 instructions, and a 64-bit version would be twice that.

Figure 5-15 Number of trailing zeros, binary search tree.

int ntz(char x) {

   if (x & 15) {

    ?if (x & 3) {

         if (x & 1) return 0;

         else return 1;

      else if (x & 4) return 2;

      else return 3;

   else if (x & 0x30) {

    ?if (x & 0x10) return 4;

      else return 5;

   else if (x & 0x40) return 6;

   else if (x) return 7;

   else return 8;

If the number of trailing 0's is expected to be small or large, then the simple loops below are quite fast. The left-hand algorithm executes in 5 + 3ntz(x) and the right one in 3 + 3(32 - ntz(x)) basic RISC instructions.

Figure 5-16 Number of trailing zeros, simple counting loops.

int ntz(unsigned x) {

   int n;

   x = ~x & (x - 1);

   n = 0;                       // n = 32;

   while(x != 0) {    ?        // while (x != 0) {

    ?n = n + 1;              ?//  ?n = n - 1;

      x = x >> 1;               //  ?x = x + x;

   }                          ?// }

   return n;                  ?// return n;

It is interesting to note that if the numbers x are uniformly distributed, then the average number of trailing 0's is, very nearly, 1.0. To see this, sum the products p_in_i, where p_i is the probability that there are exactly n_i trailing 0's. That is,

graphics/05icon21.gif

To evaluate this sum, consider the following array:

graphics/05icon22.gif

The sum of each column is a term of the series for S. Hence S is the sum of all the numbers in the array. The sum of the rows are

graphics/05icon23.gif

and the sum of these is 1/2 + 1/4 + 1/8 + ?= 1. The absolute convergence of the original series justifies the rearrangement.

Sometimes, a function similar to ntz(x) is wanted, but a 0 argument is a special case, perhaps an error, that should be identified with a value of the function that's easily distinguished from the "normal" values of the function. For example, let us define "the number of factors of 2 in x" to be

graphics/05icon24.gif

This can be calculated from

Applications

[GLS1] points out some interesting applications of the number of trailing zeros function. It has been named the "ruler function" by Eric Jensen, because it gives the height of a tick mark on a ruler that's divided into halves, quarters, eighths, and so on.

It has an application in R. W. Gosper's loop-detection algorithm, which will now be described in some detail because it is quite elegant and it does more than might at first seem possible.

Suppose a sequence X₀, X₁, X₂, ?is defined by X_n _{+ 1} = f(X_n). If the range of f is finite, the sequence is necessarily periodic. That is, it consists of a leader X₀, X₁, ? X_{?- 1} followed by a cycle X_{?/sub>, X_u
_{+ 1}, ?span class=docemphasis1>X}_?+_l_{- 1} that repeats without limit (X_{?/sub>
= X_?+}_l, X_{?+ 1} = X_?
+_l_{+ 1}, and so on, where l is the period of the cycle). Given the function f, the loop-detection problem is to find the index m of the first element that repeats, and the period l. Loop detection has applications in testing random number generators and detecting a cycle in a linked list.

One could save all the values of the sequence as they are produced, and compare each new element with all the preceding ones. This would immediately show where the second cycle starts. But algorithms exist that are much more efficient in space and time.

Perhaps the simplest is due to R. W. Floyd [Knu2, sec. 3.1, prob. 6]. This algorithm iterates the process

with x and y initialized to X₀. After the nth step, x = X_n and y = X_2n. These are compared, and if equal, it is known that X_n and X_2n are separated by an integral multiple of the period l—that is, 2n - n = n is a multiple of l. Then m can be determined by regenerating the sequence from the beginning, comparing X₀ to X_n, then X₁ to X_n₊₁, and so on. Equality occurs when X_{?/sub> is compared to X_n _{+ ?/sub>. Finally,}}l can be determined by regenerating more elements, comparing X_{?/sub>
to X_{?+ 1}, X_{?
+ 2}, ?This algorithm requires only a small and bounded amount of space,
but it evaluates f many times.}

Gosper's algorithm [HAK, item 132; Knu2, Answers to Exercises for sec. 3.1, prob. 7] finds the period l, but not the starting point m of the first cycle. Its main feature is that it never backs up to reevaluate f, and it is quite economical in space and time. It is not bounded in space; it requires a table of size log₂(L) + 1, where L is the largest possible period. This is not a lot of space; for example, if it is known a priori that L 2³², then 33 words suffice.

Gosper's algorithm, coded in C, is shown in Figure 5-17. This C function is given the function f being analyzed and a starting value X₀. It returns lower and upper bounds on m, and the period l. (Although Gosper's algorithm cannot compute m, it can compute lower and upper bounds m_l and m_u such that m_u - m_l + 1 max(l - 1, 1).) The algorithm works by comparing X_n, for n = 1, 2, ? to a subset of size of the elements of the sequence that precede X_n. The elements of the subset are the closest preceding X_i such that i + 1 ends in a 1-bit (that is, i is the even number preceding n), the closest preceding X_i such that i + 1 ends in exactly one 0-bit, the closest preceding X_i such that i + 1 ends in exactly two 0-bits, and so on.

Figure 5-17 Gosper's loop-detection algorithm.

void ld_Gosper(int (*f)(int), int X0, int *mu_l,

                              int *mu_u, int *lambda) {

   int Xn, k, m, kmax, n, lgl;

   int T[33];

   T[0] = X0;

   Xn = X0;

   for (n = 1; ; n++) {

    ?Xn = f(Xn);

      kmax = 31 - nlz(n);    ?    ?/ Floor(log2 n).

      for (k = 0; k <= kmax; k++) {

         if (Xn == T[k]) goto L;

      T[ntz(n+1)] = Xn;             // No match.

L:

  ?/ Compute m = max{i | i < n and ntz(i+1) = k}.

   m = ((((n >> k) - 1) | 1) << k) - 1;

  ?lambda = n - m;

   lgl = 31 - nlz(*lambda - 1); // Ceil(log2 lambda) - 1.

  ?span
lang=EN-GB>*mu_u = m;                       // Upper bound on mu.

  ?mu_l = m - max(1, 1 << lgl) + 1;// Lower bound on mu.

Thus, the comparisons proceed as follows:

graphics/05icon28.gif

It can be shown that the algorithm always terminates with n somewhere in the second cycle—that is, with n < m + 2l. See [Knu2] for further details.

The ruler function reveals how to solve the Tower of Hanoi puzzle. Number the n disks from 0 to n - 1. At each move k, as k goes from 1 to 2ⁿ - 1, move disk ntz(k) the minimum permitted distance to the right, in a circular manner.

The ruler function can be used to generate a reflected binary Gray code (see Section 13-1 on page 235). Start with an arbitrary n-bit word, and at each step k, as k goes from 1 to 2ⁿ - 1, flip bit ntz(k).