The Block Cipher Square Algorithm

Dr. Dobb's Journal October 1997

A fast block cipher that uses a 128-bit key

By Joan Daemen, Lars R. Knudsen, and Vincent Rijmen

Joan is an employee of Banksys, Brussels. He can be contacted at Lars and Vincent are with the COSIC research group of the Department of Electrical Engineering of the Katholieke Universiteit Leuven, Belgium. They can be contacted at and, respectively.
Sidebar: The Galois Field GF(28)

Square is a new fast block cipher that encrypts data in blocks of 128 bits, using a 128-bit key. Most cipher designs are optimized for one specific implementation platform. Square's structure (that is, its building blocks and their interaction) has been carefully chosen to allow very efficient implementations on a wide range of processors. This versatility is achieved through the use of operations in the finite field GF(28), which is a new approach in block cipher design. A reference implementation was written in C by Paulo Barreto ( and runs at 2.63 MB on a 100 Mhz Pentium (available electronically; see "Availability," page 3). By optimizing in assembler, this speed was doubled.


Square is an iterated, but not a Feistel, cipher. Instead, every round acts upon all 16 bytes of its the input. One round of Square consists of four different operations. These operations can be efficiently combined in a single set of table lookups and XOR operations. The round operations are easiest to visualize if you think of the 16 bytes as being arranged in a 4×4 array; see Figure 1. An intermediate value a can then be represented by the 16 bytes ai,j,0<=i,j< 4. The four operations are:


Square was designed following the Wide Trail Design Strategy (see "Cipher and Hash Function Design Strategies based on Linear and Differential Cryptanalysis," by Joan Daemen, doctoral dissertation, March 1995, K.U. Leuven), which separates the functionality of the block cipher round operation in separate layers -- the nonlinear substitution, linear diffusion layer, and key addition. Strict criteria are imposed on each layer to build ciphers that resist linear and differential cryptanalysis.

For a description of the basic principles of linear and differential cryptanalysis, refer to "Differential and Linear Cryptanalysis," by Bruce Schneier (DDJ, January 1996). In short, differential cryptanalysis works by applying small changes to the input of the algorithm and predicting the resulting differences at various intermediate stages of the algorithm. Resistance against differential cryptanalysis can be improved by ensuring that small differences at the input cause large differences at the output (and intermediate) values. In Square, this diffusion of differences is ensured by the two linear transformations: [theta] and [pi].

q takes care of the intrarow diffusion. If one byte of a row at the input is different, all four bytes of this row in the output are different. For example, if n bytes of a row at the input are different (1<=n<=4), at least 5-n output bytes of this row are different.

[pi] produces interrow diffusion. After [pi], the four bytes of a row are spread over the four different rows.

Together, [theta] and [pi] cause a very quick spread of differences; see Figure 2. Suppose two bytes of the same row at the input of the first round are different, then after [theta], at least three bytes of this row are different. [rho] distributes these bytes to three different rows at the input of the next round. Then 3×4 of the output bytes of [theta] are different. The differences are distributed by [rho] over the four rows, and so on.

An immediate consequence of the mathematical structure of Square is that the resistance against linear cryptanalysis can be treated in an analogous way. Square is resistant to linear and differential cryptanalysis after six of its eight rounds.

There exists, however, a dedicated attack that breaks six rounds of Square (see our paper "The Block Cipher Square," in Fast Software Encryption, E. Biham, editor, Springer-Verlag 1995). Since Square has eight rounds, this dedicated attack poses no problems.


To illustrate how Square can be used, we have written a smart card application for the Motorola M68HC05 microcontroller. (Smart cards are becoming important as a platform for implementing cryptography and security.) The machine code occupies 547 bytes of ROM and needs 36 bytes of RAM. One execution of Square (including the key schedule) takes about 7500 cycles. This corresponds to less than 2 milliseconds with a 4-MHz clock. In comparison, a typical DES implementation occupies about 1 KB of ROM. One DES operation encrypts only 64 bits and needs about 16,000 cycles.

Writing out the consecutive transformations of the cipher for 32-bit processors, the succession of steps

occurs seven times. These transformations can be combined into a single set of table lookups. An intermediate value of the text can be represented by four 32-bit words, each containing a row [ai]. Its transpose is denoted by [ai]T. For
we have Figure 3(a). We define the tables M and T as Figure 3(b). T and M have 256 entries of four bytes each. The table M implements the multiplications in GF(28). T combines the nonlinear substitution with the multiplications. Define rotrj([ai]) as the rotation of the row [ai] with j bytes to the right. This yields Figure 3(c). We conclude that
can be done with 16 table lookups, 12 rotations, and 16 XORs of 32-bit words. This implementation needs the table T, with 256 entries of four bytes -- 1 KB in total.

Further Reading

More details about Square are available in our paper "The Block Cipher Square," which has been published in Fast Software Encryption (E. Biham, editor, Springer-Verlag 1995) and at http://www


Copyright © 1997, Dr. Dobb's Journal

Back to Table of Contents