7- 4 Compress, or Generalized Extract

The APL language includes an operation called compress, written B/V, where B is a Boolean vector and V is vector of the same length as B, with arbitrary elements. The result of the operation is a vector consisting of the elements of V for which the corresponding bit in B is 1. The length of the result vector is equal to the number of 1's in B.

Here we consider a similar operation on the bits of a word. Given a mask m and a word x, the bits of x for which the corresponding mask bit is 1 are selected and moved ("compressed") to the right. For example, if the word to be compressed is (where each letter denotes a single bit)

abcd efgh ijkl mnop qrst uvwx yzAB CDEF,

and the mask is

0000 1111 0011 0011 1010 1010 0101 0101,

then the result is

0000 0000 0000 0000 efgh klop qsuw zBDF.

This operation might also be called generalized extract, by analogy with the extract instruction found on many computers.

We are interested in code for this operation with minimum worst-case execution time, and offer the simple loop of Figure 7-5 as a straw man to be improved upon. This code has no branches in the loop, and it executes in 260 instructions worst case, including the subroutine prolog and epilog.

Figure 7-5 A simple loop for the compress operation.

unsigned compress(unsigned x, unsigned m) {

   unsigned r, s, b;  ?// Result, shift, mask bit.

   r = 0;

   s = 0;

   do {

    ?b = m & 1;

      r = r | ((x & b) << s);

      s = s + b;

      x = x >> 1;

      m = m >> 1;

   } while (m != 0);

   return r;

It is possible to improve on this by repeatedly using the parallel prefix method (see page 75) with the exclusive or operation [GLS1]. We will denote the parallel prefix operation by PP-XOR. The basic idea is to first identify the bits of argument x that are to be moved right an odd number of bit positions, and move those. (This operation is simplified if x is first anded with the mask, to clear out irrelevant bits.) Mask bits are moved in the same way. Next, we identify the bits of x that are to be moved an odd multiple of 2 positions (2, 6, 10, and so on), and we then move these bits of x and the mask. Next, we identify and move the bits that are to be moved an odd multiple of 4 positions, then those that move an odd multiple of 8, and then those that move 16 bit positions.

Because this algorithm, believed to be original with [GLS1], is a bit difficult to understand, and because it is perhaps surprising that something along these lines can be done at all, we will describe its operation in some detail. Suppose the inputs are

x = abcd efgh ijkl mnop qrst uvwx yzAB CDEF,

m = 1000 1000 1110 0000 0000 1111 0101 0101,

    1  ?1  ?111

    9  ?6  ?333          ?4444?3 2?1 0

where each letter in x represents a single bit (with value 0 or 1). The numbers below each 1-bit in the mask m denote how far the corresponding bit of x must move to the right. This is the number of 0's in m to the right of the bit. As mentioned above, it is convenient to first clear out the irrelevant bits of x, giving

x = a000 e000 ijk0 0000 0000 uvwx 0z0B 0D0F.

The plan is to first determine which bits move an odd number of positions (to the right), and move those one bit position. Recall that the PP-XOR operation results in a 1-bit at each position where the number of 1's at and to the right of that position is odd. We wish to identify those bits for which the number of 0's strictly to the right is odd. This can be done by computing mk = ~m << 1 and performing PP-XOR on the result. This gives

mk = 1110 1110 0011 1111 1110 0001 0101 0100,

mp = 1010 0101 1110 1010 1010 0000 1100 1100.

Observe that mk identifies the bits of m that have a 0 immediately to the right, and mp sums these, modulo 2, from the right. Thus, mp identifies the bits of m that have an odd number of 0's to the right.

The bits that will be moved one position are those that are in positions that have an odd number of 0's strictly to the right (identified by mp) and that have a 1-bit in the original mask. This is simply mv = mp & m:

mv = 1000 0000 1110 0000 0000 0000 0100 0100.

These bits of m may be moved with the assignment

m = (m ^ mv) | (mv >> 1);

and the same bits of x may be moved with the two assignments

t = x & mv;

x = (x ^ t) | (t >> 1);

(Moving the bits of m is simpler because all the selected bits are 1's.) Here the exclusive or is turning off bits known to be 1 in m and x, and the or is turning on bits known to be 0 in m and x. The operations could also, alternatively, both be exclusive or, or subtract and add, respectively. The results, after moving the bits selected by mv right one position, are:

m = 0100 1000 0111 0000 0000 1111 0011 0011,

x = 0a00 e000 0ijk 0000 0000 uvwx 00zB 00DF.

Now we must prepare a mask for the second iteration, in which we identify bits that are to move an odd multiple of 2 positions to the right. Notice that the quantity mk & ~mp identifies those bits that have a 0 immediately to the right in the original mask m, and those bits that have an even number of 0's to the right in the original mask. These properties apply jointly, although not individually, to the revised mask m. (That is to say, mk identifies all the positions in the revised mask m that have a 0 to the immediate right and an even number of 0's to the right.) This is the quantity that, if summed from the right with PP-XOR, identifies those bits that move to the right an odd multiple of 2 positions (2, 6, 10, and so on). Therefore, the procedure is to assign this quantity to mk and perform a second iteration of the above steps. The revised value of mk is

mk = 0100 1010 0001 0101 0100 0001 0001 0000.

A complete C function for this operation is shown in Figure 7-6. It does the job in 127 basic RISC instructions (constant), including the subroutine prolog and epilog. Figure 7-7 shows the sequence of values taken on by certain variables at key points in the computation, with the same inputs that were used in the discussion above. Observe that a by-product of the algorithm, in the last value assigned to m, is the original m with all its 1-bits compressed to the right.

Figure 7.6 Parallel prefix method for the compress operation.

unsigned compress(unsigned x, unsigned m) {

   unsigned mk, mp, mv, t;

   int i;

   x = x & m;           // Clear irrelevant bits.

   mk = ~m << 1;      ?// We will count 0's to right.

   for (i = 0; i < 5; i++) {

    ?mp = mk ^ (mk << 1);             // Parallel prefix.

      mp = mp ^ (mp << 2);

      mp = mp ^ (mp << 4);

      mp = mp ^ (mp << 8);

      mp = mp ^ (mp << 16);

      mv = mp & m;                     // Bits to move.

      m = m ^ mv | (mv >> (1 << i));   // Compress m.

      t = x & mv;

      x = x ^ t | (t >> (1 << i));     // Compress x.

      mk = mk & ~mp;

   return x;

Figure 7.7 Operation of the parallel prefix method for the compress operation.

           x = abcd efgh ijkl mnop qrst uvwx yzAB CDEF

           m = 1000 1000 1110 0000 0000 1111 0101 0101

           x = a000 e000 ijk0 0000 0000 uvwx 0z0B 0D0F

i = 0,  ?mk = 1110 1110 0011 1111 1110 0001 0101 0100

After PP, mp = 1010 0101 1110 1010 1010 0000 1100 1100

          mv = 1000 0000 1110 0000 0000 0000 0100 0100

           m = 0100 1000 0111 0000 0000 1111 0011 0011

           x = 0a00 e000 0ijk 0000 0000 uvwx 00zB 00DF

i = 1,  ?mk = 0100 1010 0001 0101 0100 0001 0001 0000

After PP, mp = 1100 0110 0000 1100 1100 0000 1111 0000

          mv = 0100 0000 0000 0000 0000 0000 0011 0000

           m = 0001 1000 0111 0000 0000 1111 0000 1111

           x = 000a e000 0ijk 0000 0000 uvwx 0000 zBDF

i = 2,  ?mk = 0000 1000 0001 0001 0000 0001 0000 0000

After PP, mp = 0000 0111 1111 0000 1111 1111 0000 0000

          mv = 0000 0000 0111 0000 0000 1111 0000 0000

           m = 0001 1000 0000 0111 0000 0000 1111 1111

           x = 000a e000 0000 0ijk 0000 0000 uvwx zBDF

i = 3,  ?mk = 0000 1000 0000 0001 0000 0000 0000 0000

After PP, mp = 0000 0111 1111 1111 0000 0000 0000 0000

          mv = 0000 0000 0000 0111 0000 0000 0000 0000

           m = 0001 1000 0000 0000 0000 0111 1111 1111

           x = 000a e000 0000 0000 0000 0ijk uvwx zBDF

i = 4,  ?mk = 0000 1000 0000 0000 0000 0000 0000 0000

After PP, mp = 1111 1000 0000 0000 0000 0000 0000 0000

          mv = 0001 1000 0000 0000 0000 0000 0000 0000

           m = 0000 0000 0000 0000 0001 1111 1111 1111

           x = 0000 0000 0000 0000 000a eijk uvwx zBDF

We calculate that the algorithm of Figure 7-6 would execute in 169 instructions on a 64-bit basic RISC, as compared to 516 (worst case) for the algorithm of Figure 7-5.

The number of instructions required by the algorithm of Figure 7-6 can be reduced substantially if the mask m is a constant. This can occur in two situations: (1) a call to "compress(x, m)" occurs in a loop, in which the value of m is not known but it is a loop constant, and (2) the value of m is known and the code for compress is generated in advance, perhaps by a compiler.

Notice that the value assigned to x in the loop in Figure 7-6 is not used in the loop for anything other than the assignment to x. And, x is dependent only on itself and variable mv. Therefore, the subroutine can be coded with all references to x deleted, and the five values computed for mv can be saved in variables mv0, mv1, ? mv4. Then, in situation (1) the function without references to x can be placed outside the loop in which "compress(x, m)" occurs, and the following statements can be placed in the loop:

x = x & m;

t = x & mv0;?x = x ^ t | (t >> 1);

t = x & mv1;?x = x ^ t | (t >> 2);

t = x & mv2;?x = x ^ t | (t >> 4);

t = x & mv3;?x = x ^ t | (t >> 8);

t = x & mv4;?x = x ^ t | (t >> 16);

This is only 21 instructions in the loop (the loading of the constants can be placed outside the loop), a considerable improvement over the 127 required by the full subroutine of Figure 7-7.

In situation (2), in which the value of m is known, the same sort of thing can be done, and further optimization may be possible. It might happen that one of the five masks is 0, in which case one of the five lines shown above can be omitted. For example, mask m1 is 0 if it happens that no bit moves an odd number of positions, and m4 is 0 if no bit moves more than 15 positions, and so on.

As an example, for

m = 0101 0101 0101 0101 0101 0101 0101 0101,

the calculated masks are

mv0 = 0100 0100 0100 0100 0100 0100 0100 0100

mv1 = 0011 0000 0011 0000 0011 0000 0011 0000

mv2 = 0000 1111 0000 0000 0000 1111 0000 0000

mv3 = 0000 0000 1111 1111 0000 0000 0000 0000

mv4 = 0000 0000 0000 0000 0000 0000 0000 0000

Because the last mask is 0, in the compiled code situation this compression operation is done in 17 instructions (not counting the loading of the masks). This is not quite as good as the code shown for this operation on page 108 (13 instructions, not counting the loading of masks), which takes advantage of the fact that alternate bits are being selected.

Using Insert and Extract

If your computer has the insert instruction, preferably with immediate values for the operands that identify the bit field in the target register, then in the compiled situation insert can often be used to do the compress operation with fewer instructions than the methods discussed above. Furthermore, it doesn't tie up registers holding the masks.

The target register is initialized to 0, and then, for each contiguous group of 1's in the mask m, variable x is shifted right to right-justify the next field, and the insert instruction is used to insert the bits of x in the appropriate place in the target register. This does the operation in 2n + 1 instructions, where n is the number of fields (groups of consecutive 1's) in the mask. The worst case is 33 instructions, because the maximum number of fields is 16 (which occurs for alternating 1's and 0's).

An example in which the insert method uses substantially fewer instructions is m = 0x0010084A. Compressing with this mask requires moving bits 1, 2, 4, 8, and 16 positions. Thus, it takes the full 21 instructions for the parallel prefix method, but only 11 instructions for the insert method (there are five fields). A more extreme case is m = 0x80000000. Here a single bit moves 31 positions, requiring 21 instructions for the parallel prefix method, but only three instructions for the insert method and only one instruction (shift right 31) if you are not constrained to any particular scheme.

You can also use the extract instruction in various simple ways to do the compress operation with a known mask in 3n - 2 instructions, where n is the number of fields in the mask.

Clearly, the problem of compiling optimal code for the compress operation with a known mask is a difficult one.

Compress Left

To compress bits to the left, obviously you can reverse the argument x and the mask, compress right, and reverse the result. Another way is to compress right and then shift left by pop(m?/span>). These might be satisfactory if your computer has an instruction for bit reversal or population count, but if not, the algorithm of Figure 7-6 is easily adapted: Just reverse the direction of all the shifts except the two in the expressions 1 << i (eight to change).