7-5 General Permutations, Sheep and Goats Operation

To do general permutations of the bits in a word, or of anything else, a central problem is how to represent the permutation. It cannot be represented very compactly. Because there are 32! permutations of the bits in a 32-bit word, at least or three words plus 22 bits, are required to designate one permutation out of the 32!.

One interesting way to represent permutations is closely related to the compression operations discussed in Section 7-4 [GLS1]. Start with the direct method of simply listing the bit position to which each bit moves. For example, for the permutation done by a rotate left of four bit positions, the bit at position 0 (the least significant bit) moves to position 4, 1 moves to 5, ? 31 moves to 3. This permutation can be represented by the vector of 32 5-bit indexes:

...

Treating that as a bit matrix, the representation we have in mind is its transpose, except reflected about the off diagonal so the top row contains the least significant bits and the result uses little-endian bit numbering. This we store as five 32-bit words in array p:

p[0] = 1010 1010 1010 1010 1010 1010 1010 1010

p[1] = 1100 1100 1100 1100 1100 1100 1100 1100

p[2] = 0000 1111 0000 1111 0000 1111 0000 1111

p[3] = 0000 1111 1111 0000 0000 1111 1111 0000

p[4] = 0000 1111 1111 1111 1111 0000 0000 0000

Each bit of p[0] is the least significant bit of the position to which the corresponding bit of x moves, each bit of p[1] is the next more significant bit, and so on. This is similar to the encoding of the masks denoted by mv in the previous section, except that mv applies to revised masks in the compress algorithm, not to the original mask.

The compression operation we need compresses to the left all bits marked with 1's in the mask, and compresses to the right all bits marked with 0's. ^[1] This is sometimes called the "sheep and goats" operation (SAG), or "generalized unshuffle." It can be calculated with

^[1] If big-endian bit numbering is used, compress to the left all bits marked with 0's, and to the right all bits marked with 1's.

SAG(x, m) = compress_left(x, m) | compress(x, ~m).

With SAG as a fundamental operation, and a permutation p described as above, the bits of a word x can be permuted by p in the following 15 steps:

x  ?= SAG(x,  ?p[0]);

p[1] = SAG(p[1], p[0]);

p[2] = SAG(p[2], p[0]);

p[3] = SAG(p[3], p[0]);

p[4] = SAG(p[4], p[0]);

x  ?= SAG(x,  ?p[1]);

p[2] = SAG(p[2], p[1]);

p[3] = SAG(p[3], p[1]);

p[4] = SAG(p[4], p[1]);

x  ?= SAG(x,  ?p[2]);

p[3] = SAG(p[3], p[2]);

p[4] = SAG(p[4], p[2]);

x  ?= SAG(x,  ?p[3]);

p[4] = SAG(p[4], p[3]);

x  ?= SAG(x,  ?p[4]);

In these steps, SAG is used to perform a stable binary radix sort. Array p is used as 32 5-bit keys to sort the bits of x. In the first step, all bits of x for which p[0] = 1 are moved to the left half of the resulting word, and all those for which p[0] = 0 are moved to the right half. Other than this, the order of the bits is not changed (that is, the sort is "stable"). Then all the keys that will be used for the next round of sorting are similarly sorted. The sixth line is sorting x based on the second least significant bit of the key, and so on.

Similarly to the situation of compressing, if a certain permutation p is to be used on a number of words x, then a considerable savings results by precomputing most of the steps above. The permutation array is revised to

p[1] = SAG(p[1], p[0]);

p[2] = SAG(SAG(p[2], p[0]), p[1]);

p[3] = SAG(SAG(SAG(p[3], p[0]), p[1]), p[2]);

p[4] = SAG(SAG(SAG(SAG(p[4], p[0]), p[1]), p[2]), p[3]);

and then each permutation is done with

x = SAG(x, p[0]);

x = SAG(x, p[1]);

x = SAG(x, p[2]);

x = SAG(x, p[3]);

x = SAG(x, p[4]);

A more direct (but perhaps less interesting) way to do general permutations of the bits in a word is to represent a permutation as a sequence of 32 5-bit indexes.

The kth index is the bit number in the source from which the kth bit of the result comes. (This is a "comes from" list, whereas the SAG method uses a "goes to" list.) These could be packed six to a 32-bit word, thus requiring six words to hold all 32 bit indexes. An instruction can be implemented in hardware such as

bitgather Rt,Rx,Ri,

where register Rt is a target register (and also a source), register Rx contains the bits to be permuted, and register Ri contains six 5-bit indexes (and two unused bits). The operation of the instruction is

In words, the contents of the target register are shifted left six bit positions, and six bits are selected from word x and placed in the vacated six positions of t. The bits selected are given by the six 5-bit indexes in word i, taken in left-to-right order. The bit numbering in the indexes could be either little- or big-endian, and the operation would probably be as described for either type of machine.

To permute a word, use a sequence of six such instructions, all with the same Rt and Rx, but different index registers. In the first index register of the sequence, only indexes i₄ and i₅ are significant, as the bits selected by the other four indexes are shifted out of the left end of Rt.

An implementation of this instruction would most likely allow index values to be repeated, so the instruction can be used to do more than permute bits. It can be used to repeat any selected bit any number of times in the target register. The SAG operation lacks this generality.

It is not unduly difficult to implement this as a fast (e.g., one cycle) instruction. The bit selection circuit consists of six 32:1 MUX's. If these are built from five stages of 2:1 MUX's in today's technology (6 ?31 = 186 MUX's in all), the instruction would be faster than a 32-bit add instruction [MD].

Permuting bits has applications in cryptography, and the closely related operation of permuting subwords (e.g., permuting the bytes in a word) has applications in computer graphics. Both of these applications are more likely to deal with 64-bit words, or possibly with 128, than with 32. The SAG and bitgather methods apply with obvious changes to these larger word sizes.

To encrypt or decrypt a message with the Data Encryption Standard (DES) algorithm requires a large number of permutation-like mappings. First, key generation is done, once per session. This involves 17 permutation-like mappings. The first, called "permuted choice 1," maps from a 64-bit quantity to a 56-bit quantity (it selects the 56 non-parity bits from the key and permutes them). This is followed by 16 permutation-like mappings from 56 bits to 48 bits, all using the same mapping, called "permuted choice 2."

Following key generation, each block of 64 bits in the message is subjected to 34 permutation-like operations. The first and last operations are 64-bit permutations, one being the inverse of the other. There are 16 permutations with repetitions that map 32-bit quantities to 48 bits, all using the same mapping. Finally, there are 16 32-bit permutations, all using the same permutation. The total number of distinct mappings is six. They are all constants and are given in [DES].

DES is obsolete, as it was proved to be insecure in 1998 by the Electronic Frontier Foundation, using special hardware. The National Institute of Standards and Technology (NIST) has endorsed a temporary replacement called Triple DES, which consists of DES run serially three times on each 64-bit block, each time with a different key (that is, the key length is 192 bits, including 24 parity bits). Hence it takes three times as many permutation operations as does DES to encrypt or decrypt.

However, the "permanent" replacement for DES and Triple DES, the Advanced Encryption Standard (previously known as the Rijndael algorithm [AES]), involves no bit-level permutations. The closest it comes to a permutation is a simple rotation of 32-bit words by a multiple of 8-bit positions. Other encryption methods proposed or in use generally involve far fewer bit-level permutations than DES.

To compare the two permutation methods discussed here, the bitgather method has the advantages of (1) simpler preparation of the index words from the raw data describing the permutation, (2) simpler hardware, and (3) more general mappings. The SAG method has the advantages of (1) doing the permutation in five rather than six instructions, (2) having only two source registers in its instruction format (which might fit better in some RISC architectures), (3) scaling better to permute a doubleword quantity, and (4) permuting subwords more efficiently.

Item (3) is discussed in [LSY]. The SAG instruction allows for doing a general permutation of a two-word quantity with two executions of the SAG instruction, a few basic RISC instructions, and two full permutations of single words. The bitgather instruction allows for doing it by executing three full permutations of single words plus a few basic RISC instructions. This does not count preprocessing of the permutation to produce new quantities that depend only on the permutation. We leave it to the reader to discover these methods.

Regarding item (4), to permute, for example, the four bytes of a word with bitgather requires executing six instructions, the same as for a general bit permutation by bitgather. But with SAG it can be done in only two instructions, rather than the five required for a general bit permutation by SAG. The gain in efficiency applies even when the subwords are not a power of 2 in size; the number of steps required is where n is the number of subwords, not counting a possible non-participating group of bits that stays at one end or the other.

[LSY] discusses the SAG and bitgather instructions (called "GRP" and "PPERM," respectively), other possible permutation instructions based on networks, and permuting by table lookup.