The APL language includes an operation called compress, written B/V, where B is a Boolean vector and V is vector of the same length as B, with arbitrary elements. The result of the operation is a vector consisting of the elements of V for which the corresponding bit in B is 1. The length of the result vector is equal to the number of 1's in B.

Here we consider a similar operation on the bits of a word. Given a mask m and a word x, the bits of x for which the corresponding mask bit is 1 are selected and moved ("compressed") to the right. For example, if the word to be compressed is (where each letter denotes a single bit)

abcd efgh ijkl mnop qrst uvwx yzAB CDEF,

and the mask is

0000 1111 0011 0011 1010 1010 0101 0101,

then the result is

0000 0000 0000 0000 efgh klop qsuw zBDF.

This operation might also be called generalized extract, by analogy with the extract instruction found on many computers.

We are interested in code for this operation with minimum worst-case execution time, and offer the simple loop of Figure 7-5 as a straw man to be improved upon. This code has no branches in the loop, and it executes in 260 instructions worst case, including the subroutine prolog and epilog.

unsigned compress(unsigned x, unsigned m) {

牋 unsigned r, s, b;牋?// Result, shift, mask bit.

牋 r = 0;

牋爏 = 0;

牋燿o {

牋牋?b = m & 1;

牋牋牋r = r | ((x & b) << s);

牋牋牋s = s + b;

牋牋牋x = x >> 1;

牋牋牋m = m >> 1;

牋爙 while (m != 0);

牋爎eturn r;

}

It is possible to improve on this by repeatedly using the parallel prefix method (see page 75) with the exclusive or operation [GLS1]. We will denote the parallel prefix operation by PP-XOR. The basic idea is to first identify the bits of argument x that are to be moved right an odd number of bit positions, and move those. (This operation is simplified if x is first anded with the mask, to clear out irrelevant bits.) Mask bits are moved in the same way. Next, we identify the bits of x that are to be moved an odd multiple of 2 positions (2, 6, 10, and so on), and we then move these bits of x and the mask. Next, we identify and move the bits that are to be moved an odd multiple of 4 positions, then those that move an odd multiple of 8, and then those that move 16 bit positions.

Because this algorithm, believed to be original with [GLS1], is a bit difficult to understand, and because it is perhaps surprising that something along these lines can be done at all, we will describe its operation in some detail. Suppose the inputs are

x = abcd efgh ijkl mnop qrst uvwx yzAB CDEF,

m = 1000 1000 1110 0000 0000 1111 0101 0101,

牋牋1牋?1牋?111

牋牋9牋?6牋?333牋牋牋牋牋?4444?3 2?1 0

where each letter in `x`
represents a single bit (with value 0 or 1). The numbers below each 1-bit in
the mask `m` denote how far the
corresponding bit of `x` must move
to the right. This is the number of 0's in `m`
to the right of the bit. As mentioned above, it is convenient to first clear
out the irrelevant bits of `x`,
giving

x = a000 e000 ijk0 0000 0000 uvwx 0z0B 0D0F.

The plan is to first determine which bits move an odd number
of positions (to the right), and move those one bit position. Recall that the
PP-XOR operation results in a 1-bit at each position where the number of 1's at
and to the right of that position is odd. We wish to identify those bits for
which the number of 0's strictly to the right is odd. This can be done by
computing `mk` `=` `~m`
`<<` `1` and performing PP-XOR on the result.
This gives

mk = 1110 1110 0011 1111 1110 0001 0101 0100,

mp = 1010 0101 1110 1010 1010 0000 1100 1100.

Observe that `mk`
identifies the bits of `m` that
have a 0 immediately to the right, and `mp`
sums these, modulo 2, from the right. Thus, `mp`
identifies the bits of `m` that
have an odd number of 0's to the right.

The bits that will be moved one position are those that are in
positions that have an odd number of 0's strictly to the right (identified by `mp`) and that have a 1-bit in the original
mask. This is simply `mv` `=` `mp`
`&` `m`:

mv = 1000 0000 1110 0000 0000 0000 0100 0100.

These bits of `m`
may be moved with the assignment

m = (m ^ mv) | (mv >> 1);

and the same bits of `x`
may be moved with the two assignments

t = x & mv;

x = (x ^ t) | (t >> 1);

(Moving the bits of `m`
is simpler because all the selected bits are 1's.) Here the exclusive or is turning off bits known to be 1 in `m` and `x`,
and the or is turning on bits known to be 0 in `m` and `x`.
The operations could also, alternatively, both be exclusive
or, or subtract and add, respectively. The results, after moving the bits
selected by `mv` right one
position, are:

m = 0100 1000 0111 0000 0000 1111 0011 0011,

x = 0a00 e000 0ijk 0000 0000 uvwx 00zB 00DF.

Now we must prepare a mask for the second iteration, in which
we identify bits that are to move an odd multiple of 2 positions to the right.
Notice that the quantity `mk` `&` `~mp`
identifies those bits that have a 0 immediately to the right in the original
mask `m`, and those bits that have
an even number of 0's to the right in the original mask. These properties apply
jointly, although not individually, to the revised mask `m`. (That is to say, `mk` identifies all
the positions in the revised mask `m`
that have a 0 to the immediate right and an even number of 0's to the right.)
This is the quantity that, if summed from the right with PP-XOR, identifies
those bits that move to the right an odd multiple of 2 positions (2, 6, 10, and
so on). Therefore, the procedure is to assign this quantity to `mk` and perform a second iteration of the
above steps. The revised value of `mk`
is

mk = 0100 1010 0001 0101 0100 0001 0001 0000.

A complete C function for this operation is shown in Figure 7-6. It does the job in 127 basic RISC
instructions (constant), including the subroutine prolog and epilog. Figure 7-7 shows the sequence of values taken
on by certain variables at key points in the computation, with the same inputs
that were used in the discussion above. Observe that a by-product of the
algorithm, in the last value assigned to `m`,
is the original `m` with all its
1-bits compressed to the right.

unsigned compress(unsigned x, unsigned m) {

牋 unsigned mk, mp, mv, t;

牋爄nt i;

牋 x = x & m;牋牋牋牋牋 // Clear irrelevant bits.

牋爉k = ~m << 1;牋牋牋?// We will count 0's to right.

牋 for (i = 0; i < 5; i++) {

牋牋?mp = mk ^ (mk << 1);牋牋牋牋牋牋 // Parallel prefix.

牋牋牋mp = mp ^ (mp << 2);

牋牋牋mp = mp ^ (mp << 4);

牋牋牋mp = mp ^ (mp << 8);

牋牋牋mp = mp ^ (mp << 16);

牋牋牋mv = mp & m;牋牋牋牋牋牋牋牋牋牋 // Bits to move.

牋牋牋m = m ^ mv | (mv >> (1 << i));牋 // Compress m.

牋牋牋t = x & mv;

牋牋牋x = x ^ t | (t >> (1 << i));牋牋 // Compress x.

牋牋牋mk = mk & ~mp;

牋爙

牋爎eturn x;

}

牋牋牋牋牋 x = abcd efgh ijkl mnop qrst uvwx yzAB CDEF

牋牋牋牋牋爉 = 1000 1000 1110 0000 0000 1111 0101 0101

牋牋牋牋牋爔 = a000 e000 ijk0 0000 0000 uvwx 0z0B 0D0F

i = 0,牋?mk = 1110 1110 0011 1111 1110 0001 0101 0100

After PP, mp = 1010 0101 1110 1010 1010 0000 1100 1100

牋牋牋牋牋mv = 1000 0000 1110 0000 0000 0000 0100 0100

牋牋牋牋牋爉 = 0100 1000 0111 0000 0000 1111 0011 0011

牋牋牋牋牋爔 = 0a00 e000 0ijk 0000 0000 uvwx 00zB 00DF

i = 1,牋?mk = 0100 1010 0001 0101 0100 0001 0001 0000

After PP, mp = 1100 0110 0000 1100 1100 0000 1111 0000

牋牋牋牋牋mv = 0100 0000 0000 0000 0000 0000 0011 0000

牋牋牋牋牋爉 = 0001 1000 0111 0000 0000 1111 0000 1111

牋牋牋牋牋爔 = 000a e000 0ijk 0000 0000 uvwx 0000 zBDF

i = 2,牋?mk = 0000 1000 0001 0001 0000 0001 0000 0000

After PP, mp = 0000 0111 1111 0000 1111 1111 0000 0000

牋牋牋牋牋mv = 0000 0000 0111 0000 0000 1111 0000 0000

牋牋牋牋牋爉 = 0001 1000 0000 0111 0000 0000 1111 1111

牋牋牋牋牋爔 = 000a e000 0000 0ijk 0000 0000 uvwx zBDF

i = 3,牋?mk = 0000 1000 0000 0001 0000 0000 0000 0000

After PP, mp = 0000 0111 1111 1111 0000 0000 0000 0000

牋牋牋牋牋mv = 0000 0000 0000 0111 0000 0000 0000 0000

牋牋牋牋牋爉 = 0001 1000 0000 0000 0000 0111 1111 1111

牋牋牋牋牋爔 = 000a e000 0000 0000 0000 0ijk uvwx zBDF

i = 4,牋?mk = 0000 1000 0000 0000 0000 0000 0000 0000

After PP, mp = 1111 1000 0000 0000 0000 0000 0000 0000

牋牋牋牋牋mv = 0001 1000 0000 0000 0000 0000 0000 0000

牋牋牋牋牋爉 = 0000 0000 0000 0000 0001 1111 1111 1111

牋牋牋牋牋爔 = 0000 0000 0000 0000 000a eijk uvwx zBDF

We calculate that the algorithm of Figure 7-6 would execute in 169 instructions on a 64-bit basic RISC, as compared to 516 (worst case) for the algorithm of Figure 7-5.

The number of instructions required by the algorithm of Figure 7-6 can be reduced substantially if the
mask `m` is a constant. This can
occur in two situations: (1) a call to "`compress(x,`
`m)`" occurs in a loop, in
which the value of `m` is not
known but it is a loop constant, and (2) the value of `m` is known and the code for `compress` is generated in advance, perhaps
by a compiler.

Notice that the value assigned to `x` in the loop in Figure 7-6 is not used in the loop for
anything other than the assignment to `x`.
And, `x` is dependent only on
itself and variable `mv`.
Therefore, the subroutine can be coded with all references to `x` deleted, and the five values computed
for `mv` can be saved in variables
`mv0`, `mv1`, ? `mv4`.
Then, in situation (1) the function without references to `x` can be placed outside the loop in which
"`compress(x,` `m)`" occurs, and the following
statements can be placed in the loop:

x = x & m;

t = x & mv0;?x = x ^ t | (t >> 1);

t = x & mv1;?x = x ^ t | (t >> 2);

t = x & mv2;?x = x ^ t | (t >> 4);

t = x & mv3;?x = x ^ t | (t >> 8);

t = x & mv4;?x = x ^ t | (t >> 16);

This is only 21 instructions in the loop (the loading of the constants can be placed outside the loop), a considerable improvement over the 127 required by the full subroutine of Figure 7-7.

In situation (2), in which the value of `m` is known, the same sort of thing can be
done, and further optimization may be possible. It might happen that one of the
five masks is 0, in which case one of the five lines shown above can be
omitted. For example, mask `m1` is
0 if it happens that no bit moves an odd number of positions, and `m4` is 0 if no bit moves more than 15
positions, and so on.

As an example, for

m = 0101 0101 0101 0101 0101 0101 0101 0101,

the calculated masks are

mv0 = 0100 0100 0100 0100 0100 0100 0100 0100

mv1 = 0011 0000 0011 0000 0011 0000 0011 0000

mv2 = 0000 1111 0000 0000 0000 1111 0000 0000

`mv3 = 0000 0000 1111 1111 0000 0000 0000 0000 `

`mv4 = 0000 0000 0000 0000 0000 0000 0000 0000 `

Because the last mask is 0, in the compiled code situation this compression operation is done in 17 instructions (not counting the loading of the masks). This is not quite as good as the code shown for this operation on page 108 (13 instructions, not counting the loading of masks), which takes advantage of the fact that alternate bits are being selected.

If your computer has the insert instruction, preferably with immediate values for the operands that identify the bit field in the target register, then in the compiled situation insert can often be used to do the compress operation with fewer instructions than the methods discussed above. Furthermore, it doesn't tie up registers holding the masks.

The target register is initialized to 0, and
then, for each contiguous group of 1's in the mask `m`, variable `x` is shifted
right to right-justify the next field, and the insert
instruction is used to insert the bits of `x` in the appropriate
place in the target register. This does the operation in 2n + 1 instructions, where n
is the number of fields (groups of consecutive 1's) in the mask. The worst case
is 33 instructions, because the maximum number of fields is 16 (which occurs
for alternating 1's and 0's).

An example in which the insert method uses substantially fewer instructions
is `m` = 0x0010084A. Compressing with this mask requires moving bits 1, 2,
4, 8, and 16 positions. Thus, it takes the full 21 instructions for the parallel
prefix method, but only 11 instructions for the insert
method (there are five fields). A more extreme case is `m` =
0x80000000. Here a single bit moves 31 positions, requiring 21 instructions for
the parallel prefix method, but only three instructions for the insert method and only one instruction (shift right 31) if you are not constrained to any
particular scheme.

You can also use the extract instruction in various simple ways to do the compress operation with a known mask in 3n - 2 instructions, where n is the number of fields in the mask.

Clearly, the problem of compiling optimal code for the compress operation with a known mask is a difficult one.

To compress bits to the left, obviously you
can reverse the argument `x` and the mask, compress right, and reverse the result. Another way
is to compress right and then shift left by pop(m?/span>).
These might be satisfactory if your computer has an instruction for bit
reversal or population count, but if not, the algorithm of Figure 7-6 is
easily adapted: Just reverse the direction of all the shifts except the two in
the expressions `1` `<<` `i` (eight to change).