To do general permutations of the bits in a word, or of anything else, a central problem is how to represent the permutation. It cannot be represented very compactly. Because there are 32! permutations of the bits in a 32-bit word, at least or three words plus 22 bits, are required to designate one permutation out of the 32!.

One interesting way to represent permutations is closely related to the compression operations discussed in Section 7-4 [GLS1]. Start with the direct method of simply listing the bit position to which each bit moves. For example, for the permutation done by a rotate left of four bit positions, the bit at position 0 (the least significant bit) moves to position 4, 1 moves to 5, ? 31 moves to 3. This permutation can be represented by the vector of 32 5-bit indexes:

00100

00101

...

11111

00000

00001

00010

00011

Treating that as a bit matrix, the representation we have in
mind is its transpose, except reflected about the off diagonal so the top row
contains the least significant bits and the result uses little-endian bit
numbering. This we store as five 32-bit words in array `p`:

p[0] = 1010 1010 1010 1010 1010 1010 1010 1010

p[1] = 1100 1100 1100 1100 1100 1100 1100 1100

p[2] = 0000 1111 0000 1111 0000 1111 0000 1111

p[3] = 0000 1111 1111 0000 0000 1111 1111 0000

p[4] = 0000 1111 1111 1111 1111 0000 0000 0000

Each bit of `p[0]`
is the least significant bit of the position to which the corresponding bit of `x` moves, each bit of `p[1]` is the next more significant bit, and
so on. This is similar to the encoding of the masks denoted by `mv` in the previous section, except that `mv` applies to revised masks in the
compress algorithm, not to the original mask.

The compression operation we need compresses to the left all
bits marked with 1's in the mask, and compresses to the right all bits marked
with 0's. ^{[1]} This is sometimes called the
"sheep and goats" operation (SAG), or "generalized
unshuffle." It can be calculated with

^{[1]} If big-endian
bit numbering is used, compress to the left all bits marked with 0's, and to
the right all bits marked with 1's.

SAG(x, m) = compress_left(x, m) | compress(x, ~m).

With SAG as a fundamental operation, and a permutation `p` described as above, the bits of a word `x` can be permuted by `p` in the following 15 steps:

x牋?= SAG(x,牋?p[0]);

p[1] = SAG(p[1], p[0]);

p[2] = SAG(p[2], p[0]);

p[3] = SAG(p[3], p[0]);

p[4] = SAG(p[4], p[0]);

x牋?= SAG(x,牋?p[1]);

p[2] = SAG(p[2], p[1]);

p[3] = SAG(p[3], p[1]);

p[4] = SAG(p[4], p[1]);

x牋?= SAG(x,牋?p[2]);

p[3] = SAG(p[3], p[2]);

p[4] = SAG(p[4], p[2]);

x牋?= SAG(x,牋?p[3]);

p[4] = SAG(p[4], p[3]);

x牋?= SAG(x,牋?p[4]);

In these steps, SAG is used to perform a stable binary radix
sort. Array `p` is used as 32
5-bit keys to sort the bits of `x`.
In the first step, all bits of `x`
for which `p[0]` = 1 are moved to
the left half of the resulting word, and all those for which `p[0]` = 0 are moved to the right half.
Other than this, the order of the bits is not changed (that is, the sort is
"stable"). Then all the keys that will be used for the next round of
sorting are similarly sorted. The sixth line is sorting `x` based on the second least significant
bit of the key, and so on.

Similarly to the situation of compressing, if a certain
permutation `p` is to be used on a
number of words `x`, then a
considerable savings results by precomputing most of the steps above. The
permutation array is revised to

p[1] = SAG(p[1], p[0]);

p[2] = SAG(SAG(p[2], p[0]), p[1]);

p[3] = SAG(SAG(SAG(p[3], p[0]), p[1]), p[2]);

p[4] = SAG(SAG(SAG(SAG(p[4], p[0]), p[1]), p[2]), p[3]);

and then each permutation is done with

x = SAG(x, p[0]);

x = SAG(x, p[1]);

x = SAG(x, p[2]);

x = SAG(x, p[3]);

x = SAG(x, p[4]);

A more direct (but perhaps less interesting) way to do general permutations of the bits in a word is to represent a permutation as a sequence of 32 5-bit indexes.

The kth index is the bit number in the source from which the kth bit of the result comes. (This is a "comes from" list, whereas the SAG method uses a "goes to" list.) These could be packed six to a 32-bit word, thus requiring six words to hold all 32 bit indexes. An instruction can be implemented in hardware such as

bitgather Rt,Rx,Ri,

where register `Rt`
is a target register (and also a source), register `Rx` contains the bits to be permuted, and register `Ri` contains six 5-bit indexes (and two
unused bits). The operation of the instruction is

In words, the contents of the target register are shifted left six bit positions, and six bits are selected from word x and placed in the vacated six positions of t. The bits selected are given by the six 5-bit indexes in word i, taken in left-to-right order. The bit numbering in the indexes could be either little- or big-endian, and the operation would probably be as described for either type of machine.

To permute a word, use a sequence of six such instructions,
all with the same `Rt` and `Rx`, but different index registers. In the
first index register of the sequence, only indexes i_{4}
and i_{5} are significant, as the bits
selected by the other four indexes are shifted out of the left end of `Rt`.

An implementation of this instruction would most likely allow index values to be repeated, so the instruction can be used to do more than permute bits. It can be used to repeat any selected bit any number of times in the target register. The SAG operation lacks this generality.

It is not unduly difficult to implement this as a fast (e.g., one cycle) instruction. The bit selection circuit consists of six 32:1 MUX's. If these are built from five stages of 2:1 MUX's in today's technology (6 ?31 = 186 MUX's in all), the instruction would be faster than a 32-bit add instruction [MD].

Permuting bits has applications in cryptography, and the closely related operation of permuting subwords (e.g., permuting the bytes in a word) has applications in computer graphics. Both of these applications are more likely to deal with 64-bit words, or possibly with 128, than with 32. The SAG and bitgather methods apply with obvious changes to these larger word sizes.

To encrypt or decrypt a message with the Data Encryption Standard (DES) algorithm requires a large number of permutation-like mappings. First, key generation is done, once per session. This involves 17 permutation-like mappings. The first, called "permuted choice 1," maps from a 64-bit quantity to a 56-bit quantity (it selects the 56 non-parity bits from the key and permutes them). This is followed by 16 permutation-like mappings from 56 bits to 48 bits, all using the same mapping, called "permuted choice 2."

Following key generation, each block of 64 bits in the message is subjected to 34 permutation-like operations. The first and last operations are 64-bit permutations, one being the inverse of the other. There are 16 permutations with repetitions that map 32-bit quantities to 48 bits, all using the same mapping. Finally, there are 16 32-bit permutations, all using the same permutation. The total number of distinct mappings is six. They are all constants and are given in [DES].

DES is obsolete, as it was proved to be insecure in 1998 by the Electronic Frontier Foundation, using special hardware. The National Institute of Standards and Technology (NIST) has endorsed a temporary replacement called Triple DES, which consists of DES run serially three times on each 64-bit block, each time with a different key (that is, the key length is 192 bits, including 24 parity bits). Hence it takes three times as many permutation operations as does DES to encrypt or decrypt.

However, the "permanent" replacement for DES and Triple DES, the Advanced Encryption Standard (previously known as the Rijndael algorithm [AES]), involves no bit-level permutations. The closest it comes to a permutation is a simple rotation of 32-bit words by a multiple of 8-bit positions. Other encryption methods proposed or in use generally involve far fewer bit-level permutations than DES.

To compare the two permutation methods discussed here, the bitgather method has the advantages of (1) simpler preparation of the index words from the raw data describing the permutation, (2) simpler hardware, and (3) more general mappings. The SAG method has the advantages of (1) doing the permutation in five rather than six instructions, (2) having only two source registers in its instruction format (which might fit better in some RISC architectures), (3) scaling better to permute a doubleword quantity, and (4) permuting subwords more efficiently.

Item (3) is discussed in [LSY]. The SAG instruction allows for doing a general permutation of a two-word quantity with two executions of the SAG instruction, a few basic RISC instructions, and two full permutations of single words. The bitgather instruction allows for doing it by executing three full permutations of single words plus a few basic RISC instructions. This does not count preprocessing of the permutation to produce new quantities that depend only on the permutation. We leave it to the reader to discover these methods.

Regarding item (4), to permute, for example, the four bytes of a word with bitgather requires executing six instructions, the same as for a general bit permutation by bitgather. But with SAG it can be done in only two instructions, rather than the five required for a general bit permutation by SAG. The gain in efficiency applies even when the subwords are not a power of 2 in size; the number of steps required is where n is the number of subwords, not counting a possible non-participating group of bits that stays at one end or the other.

[LSY] discusses the SAG and bitgather instructions (called "GRP" and "PPERM," respectively), other possible permutation instructions based on networks, and permuting by table lookup.