Faster FFTs

J.G.G. Dobbe

Iwan is a hardware and software developer in the department of medical bio-engineering at the Academic Medical Center in Amsterdam, Holland.


by Bruce Schneier

The Fourier transform is an essential tool of modern applied mathematics. Named after the French mathematician and physicist Jean-Baptiste Joseph Fourier (1768--1830), it has found uses in all areas where information or signals are transmitted or analyzed: concert-hall acoustics, computer vision, speech analysis, aerodynamics, fluid control, broadcasting, and process control.

What Fourier devised is a way to express periodic functions as an infinite series of sine and cosine functions. These sorts of expressions are a way of putting all periodic functions within the same reference. Even though they are all different, you can think about them in the same way. The Fourier transform is the method for deriving the various coefficients for the sine and cosine functions.

The Fourier transform assumes that the function is periodic even though in the real world, functions are most often not. However, most applications only care about the behavior of a function within a certain interval. The trick is to assume that the function is periodic outside that interval, then just not worry about it. The Fourier transform of a sampled function that is assumed periodic outside the interval is called a "Discrete Fourier Transform" (DFT).

Getting a computer to do a DFT requires an incredible number of calculations and is generally not feasible. Consequently, mathematicians have invented the "Fast Fourier Transform" (FFT)--a smarter, divide-and-conquer algorithm that skips most of the DFT's unnecessary calculations and makes all of this stuff accessible. Think of the Fourier transform as a mathematical tool, and the FFT as a computational tool for mathematical applications.

In this month's "Algorithm Alley," Iwan describes two implementations of a radix 2 FFT algorithm, one written in C and the other in assembler. The assembler version can be linked to C or Pascal programs and can run a 1024-point FFT on a 486/66 in 36 msec, or in just 12 msec on a 60-MHz Pentium. This performance is comparable to what you can get from a dedicated digital signal processing (DSP) board--a 1024-point FFT in 1--4 msec--and is much cheaper. Figure 1 provides various performance results on a variety of processors.

When computers are involved in signal processing, an analog-to-digital converter must be used to digitize the incoming analog signal. This results in an array of samples representing the original analog signal. It is possible to perform a Discrete Fourier Transform (DFT) on the array of samples resulting in the discrete-density spectrum in Example 1, where fn is the nth sample in time domain. Because the number of complex spectral components (N) in the frequency domain equals the number of signal samples, the same information is available in both domains, and direct inverse transformation is permissible, regaining the original samples. This also eases programming because one array can be used for both time and frequency samples.

When computing an N-point DFT, N2 multiplications are required. Some of those factors appear several times throughout the calculation. Using a smarter computation method called the "Fast Fourier Transform" (FFT), it is possible to skip duplicate calculations, saving a lot of execution time.

The secret behind the FFT algorithm is splitting an N-point DFT into two N/2-point DFTs. Computing an N/2-point DFT takes N2/4 multiplications, so two N/2-point DFTs can be calculated with N2/2 multiplications. Note that this is half the number of calculations required to perform an N-point DFT. While some combining of the N/2-point DFTs is required to obtain the original N-point DFT, computation time is reduced almost by a factor of 2. By continuing this halving process, finally 2-point DFTs remain to be calculated. When the combination method is disregarded, FFT execution time is the result of the equation in Example 2. F(k) can be written as a combination of two N/2 point DFTs, F1(k) and F2(k), holding the even and odd samples of F(k), respectively; see Example 3. Note that the sequence of Q-factors is the same for both equations of F(k). This characteristic is used to reduce execution time. The equation for finding F(k) for one Q-factor is graphically represented by the so-called FFT "Butterfly" in Figure 2.

The computation of an N-point FFT can be graphically represented as a whole network of butterflies for calculating an 8-point FFT, as in Figure 3. The network in Figure 3 consists of sections that combine two N/2-point FFTs. If you read from right to left, the butterflies become smaller, resulting in the following 2-point DFTs:


Shuffle Samples

While splitting the N-point DFT into two N/2-point DFTs, one DFT consists of the even samples and the other of the odd samples. When each N/2-point DFT is split into two N/2-point DFTs, the first N/4-point DFT consists of the even samples of the N/2-point DFT, while the odd samples give the second N/4-point DFT.

By shuffling the FFT input samples in a structured way, the output data is ordered after performing the FFT. When N is a power of two, this shuffle algorithm is relatively simple. The index of an array sample in binary format can be reversed to obtain the index of the array sample for exchange. When the first half of the array is shuffled (index 0..3), the other half automatically comes in the right place; see Table 1. To find the array index for exchanging elements, a compact and fast piece of code like Example 4 can be written in assembler in which AX holds the array index for exchange with the original element (OldIndex). (The file PCFFT.ASM includes code for calculating shuffle indexes and is available electronically; see page 3.)

Speeding Up FFT Performance

The FFT function prototype void Fft(float *Re, float *Im, int Pwr, int Dir); (where Re and Im are the base addresses of the real and imaginary arrays) shows the complex array split into its real and imaginary parts. During the FFT execution, elements in the arrays are reached by adding the right offset to the base addresses. By splitting the complex array in two, the same offset applies for indexing an element in both real and imaginary arrays. This results in saving execution time, both during FFT calculation and when calculating the power spectrum from the real and imaginary elements.

Size of Complex Numbers

An FFT can be performed on any type and size of variables. Integer numbers are often small in size so less time is spent on memory access. By using floating-point numbers, accuracy of integer FFTs can be improved at the cost of execution time. Coprocessors can work with IEEE floating-point numbers of different size--short-real (32 bits), long-real (64 bits), and a so-called "temp-real" format (80 bits). However, the more bits involved, the longer it will take the coprocessor to access memory. In short, when working with floating-point numbers, the fastest algorithm can be obtained with the short-real type. Note that some Pascal compilers convert the REAL type into a 48-bits (6-byte) variant not supported by the coprocessor. In such cases, the coprocessor is emulated in software; however, execution slows down in the process.

Improving the Goniometry

When the computer executes an FFT, it must calculate the factor Qk=cos(alphak)+i*sin(alphak), with alpha=-2pi/N and k=0,1,_, N-1, over and over again. Because sin and cos are time-consuming instructions (even for coprocessors), they slow down FFT performance. It is possible, however, to use a table of constants holding start values alpha for each N-point FFT and derive cos(alphak) and sin(alphak) from the start value by faster multiplication and subtraction; see Example 5. Therefore, Qk+1 can be calculated from Qk and the table values for cos(alpha) and sin(alpha). Note that cos(alpha) and sin(alpha) for finding the first Qk for k=0 can also be found in the tables.

Using a Coprocessor

Math coprocessors are typically stack-based. The operands must be loaded onto the stack first, before operations can be released to them. You can reduce FFT execution time by reducing the number of bus accesses. When writing coprocessor assembler code, you can leave numbers on the coprocessor stack that can be used later on during the calculation. Not all compilers are this farsighted. The calculation of sin(alpha(k+1)) and cos(Aalpha(k+1)) can be computed with a very compact piece of code like that in PCFFT.ASM. With languages such as C or Pascal, the efficiency of the code is at the mercy of the compiler and often is not easy to see through and improve.

Faster FFT Code

I've implemented the FFT algorithm described here in both C and assembler. PCFFT.C (Listing One) is the C version, while PCFFT.ASM (available electronically) is written in assembler. Listing Two is the header file for PCFFT.C. Both C and assembler versions use the same function prototype: void Fft(float *Re, float *Im, int Pwr, int Dir);, where Re, Im are pointers to arrays of (32-bit) floating-point numbers holding the real and imaginary part of the input, respectively; Pwr holds the size of the arrays as a power of two (for example, when a 1024-point FFT is to be calculated, Pwr should be equal to 10); and Dir determines whether an FFT (Dir >= 1) or an inverse FFT (Dir <= 0) should be performed.

PCFFTest.C (Listing Three) includes the PCFFT.H header file and shows how the FFT function must be called. You can determine whether to use the C or assembler version by linking the right object file.

As Figure 1 illustrates, the assembler version is the fastest and makes direct use of your system's coprocessor. Set your compiler to generate the fastest code (80x86); when using the assembler version, define the COPROC287 symbol at the assembler command line--if your system has a 80x87 coprocessor--for maximum performance.

The assembler version can also be used to link with Borland Pascal, by defining the PSCL symbol at the assembler command line. When this is done, the correct parameter passing sequence is applied. With Borland Turbo Assembler (TASM 3.0), use the command lines in Example 6 to generate the object file (PCFFT.OBJ) of your choice.


Kaldewaij, A. and J. van Tiel. Voortgezette Wiskunde; Fourier-theorie en systeemtheorie. Utrecht, The Netherlands: Scheltema & Holkema, 1983.

Nieland, H.M. "Fourier Analyse." Natuur en Techniek (no. 3, 1990).

Lawrence, R.R., and B. Gold. Theory and Application of Digital Signal Processing. Englewood Cliffs, NJ: Prentice Hall, 1975.

Pettit, F. Fourier Transforms in Action. Chartwell-Bratt Ltd. (Publishing and Training), Bromley, U.K., 1985.

Press, William et al. Numerical Recipes in C, Second Edition. Cambridge: Cambridge University Press, 1992.

Example 1 Performing a DFT on an array of samples results in the discrete density spectrum. Example 2 When the combination method is disregarded, FFT execution time is the result of this equation. Example 3 F(k) can be written as a combination of two N/2-point DFTs, F1(k) and F2(k), holding the even and odd samples of F(k), respectively. Figure 1 FFT execution time on PC using Borland C++ 3.1 and assembler. The assembler code reduces execution time by about another 35 percent. Figure 2 FFT "Butterfly," where the "hub" represents a summation/ subtraction point, while the arrow represents a multiplication.

Table 1: Before performing the FFT, array elements are shuffled. The shuffle index for element X is found by placing its binary format in reversed order.

    Old     Binary   Reversed   New    
    Index   Format   Binary     Index   
    0       000      000        0
    1       001      100        4
    2       010      010        2
    3       011      110        6
    4       100      001        1
    5       101      101        5
    6       110      011        3
    7       111      111        7

Figure 3 8-point FFT in "Butterfly" notation. Reading the graph from left to right shows how N/2-point FFTs are combined. Reading from right to left shows how an N-point FFT is reduced to the calculation of 2-point DFTs.

Example 4: Assembler code to find shuffle index.

    MOV    BX, OldIndex   ; BX = Old index
    MOV    AX, 0000H      ; AX = 0
    MOV    CX, WordLength ; CX = Wordlength in bits
    CLC                   ; Clear carry flag
    RCR    BX,1           ; Rotate BX through carry right 1 bit
    RCL    AX,1           ; Rotate AX left using carry 1 bit
    LOOP   NextBit        ; Repeat till whole word reversed

Example 5 Deriving cos (alpha(k+1)) and sin (alpha(k+1)) from a table of constants.

Example 6: Command lines for (a) C-linkable code and (b) Pascal-linkable code.

For XT: TASM pcfft.asm
For AT: TASM /dCOPROC287=1 pcfft.asm
For XT: TASM /dPSCL=1 pcfft.asm
For AT: TASM /dPSCL=1 /dCOPROC287=1 pcfft.asm

Listing One

/* PCFFT.C  -- by J.G.G. Dobbe -- Performs an FFT on two arrays (Re, Im) of
               type float (can be changed). This unit is written in C and
               doesn't call assembler routines.

/* --------------------- Include directive ------------------------ */
#include "pcfft.h"

/* --------------------- Local variables -------------------------- */
static float CosArray[28] =
{ /* cos{-2pi/N} for N = 2, 4, 8, ... 16384 */
 -1.00000000000000,  0.00000000000000,  0.70710678118655,
  0.92387953251129,  0.98078528040323,  0.99518472667220,
  0.99879545620517,  0.99969881869620,  0.99992470183914,
  0.99998117528260,  0.99999529380958,  0.99999882345170,
  0.99999970586288,  0.99999992646572,
  /* cos{2pi/N} for N = 2, 4, 8, ... 16384 */
 -1.00000000000000,  0.00000000000000,  0.70710678118655,
  0.92387953251129,  0.98078528040323,  0.99518472667220,
  0.99879545620517,  0.99969881869620,  0.99992470183914,
  0.99998117528260,  0.99999529380958,  0.99999882345170,
  0.99999970586288,  0.99999992646572
static float SinArray[28] =
{ /* sin{-2pi/N} for N = 2, 4, 8, ... 16384 */
  0.00000000000000, -1.00000000000000, -0.70710678118655,
 -0.38268343236509, -0.19509032201613, -0.09801714032956,
 -0.04906767432742, -0.02454122852291, -0.01227153828572,
 -0.00613588464915, -0.00306795676297, -0.00153398018628,
 -0.00076699031874, -0.00038349518757,
  /* sin{2pi/N} for N = 2, 4, 8, ... 16384 */
  0.00000000000000,  1.00000000000000,  0.70710678118655,
  0.38268343236509,  0.19509032201613,  0.09801714032956,
  0.04906767432742,  0.02454122852291,  0.01227153828572,
  0.00613588464915,  0.00306795676297,  0.00153398018628,
  0.00076699031874,  0.00038349518757

/* --------------------- Function implementations ----------------- */
/* --------------------- ShuffleIndex ----------------------------- */
static unsigned int ShuffleIndex(unsigned int i, int WordLength)

/* Function     : Finds the shuffle index of array elements. The array length
                  must be a power of two; The power is stored in "WordLength".
   Return value : With "i" the source array index, "ShuffleIndex"
                  returns the destination index for shuffling.
   Comment      : -
  unsigned int  NewIndex;
  unsigned char BitNr;
  NewIndex = 0;
  for (BitNr = 0; BitNr <= WordLength - 1; BitNr++)
    NewIndex = NewIndex << 1;
    if ((i & 1) != 0) NewIndex = NewIndex + 1;
    i = i >> 1;
  return NewIndex;
/* --------------------- Shuffle2Arr ------------------------------ */
static void Shuffle2Arr(float *a, float *b, int bitlength)
/* Function     : Shuffles both arrays "a" and "b". This function is called 
                 before performing the actual FFT so the array elements
                 are in the right order after FFT.
  Return value : -
  Comment      : -
  unsigned int  IndexOld, IndexNew;
  float         temp;
  unsigned int  N;
  int bitlengthtemp;

  bitlengthtemp = bitlength;                  /* Save for later use */
  N = 1;                                       /* Find array-length */
    N = N * 2;
    bitlength = bitlength - 1;
  } while (bitlength > 0) ;
                                            /* Shuffle all elements */
  for (IndexOld = 0; IndexOld <= N - 1; IndexOld++)
  {                              /* Find index to exchange elements */
    IndexNew = ShuffleIndex(IndexOld, bitlengthtemp);
    if (IndexNew > IndexOld)
    {                                         /* Exchange elements: */
      temp = a[IndexOld];                             /* Of array a */
      a[IndexOld] = a[IndexNew];
      a[IndexNew] = temp;
      temp = b[IndexOld];                             /* Of array a */
      b[IndexOld] = b[IndexNew];
      b[IndexNew] = temp;
/* --------------------- Fft -------------------------------------- */
void Fft(float *Re, float *Im, int Pwr, int Dir)

/* Function     : Actual FFT algorithm. "Re" and "Im" point to start of real 
                 and imaginary arrays of numbers, "Pwr" holds the array sizes
                 as a power of 2 while "Dir" indicates whether an FFT (Dir>=1)
                 or an inverse FFT must be performed (Dir<=0).
  Return value : The transformed information is returned by "Re"
                 and "Im" (real and imaginary part respectively).
  Comment      : -
  int pwrhelp;
  int N;
  int Section;
  int AngleCounter;
  int FlyDistance;
  int FlyCount;
  int index1;
  int index2;
  float tempr, tempi;
  float Re1, Re2, Im1, Im2;
  float c, s;
  float scale;
  float sqrtn;
  float temp;
  float Qr, Qi;

  Shuffle2Arr(Re, Im, Pwr);                /* Shuffle before (i)FFT */
  pwrhelp = Pwr;                          /* Determine size of arrs */
  N = 1;
    N = N * 2;
  } while (pwrhelp > 0) ;

  if (Dir >= 1) AngleCounter = 0;                            /* FFT */
  else          AngleCounter = 14;                   /* Inverse FFT */
  Section = 1;
  while (Section < N)
    FlyDistance = 2 * Section;
    c = CosArray[AngleCounter];
    s = SinArray[AngleCounter];
    Qr = 1; Qi = 0;
    for (FlyCount = 0; FlyCount <= Section - 1; FlyCount++)
      index1 = FlyCount;
        index2 = index1 + Section;
                                            /* Perform 2-Point DFT  */
        tempr = 1.0 * Qr * Re[index2] - 1.0 * Qi * Im[index2];
        tempi = 1.0 * Qr * Im[index2] + 1.0 * Qi * Re[index2];

        Re[index2] = Re[index1] - tempr;             /* For Re-part */
        Re[index1] = Re[index1] + tempr;
        Im[index2] = Im[index1] - tempi;             /* For Im-part */
        Im[index1] = Im[index1] + tempi;

        index1 = index1 + FlyDistance;
      } while (index1 <= (N - 1));

      /*                 k                                  */
      /*  Calculate new Q = cos(ak) + j*sin(ak) = Qr + j*Qi */
      /*          -2*pi                                     */
      /*  with: a = -----                                   */
      /*            N                                       */
      temp = Qr;
      Qr   = Qr*c - Qi*s;
      Qi   = Qi*c + temp*s;
    Section = Section * 2;
    AngleCounter = AngleCounter + 1;
  if (Dir <= 0)                                    /* Normalize for */
  {                                             /* inverse FFT only */
    scale = 1.0/N;
    for (index1 = 0; index1 <= N - 1; index1++)
      Re[index1] = scale * Re[index1];
      Im[index1] = scale * Im[index1];
/* ---------------------------------------------------------------- */

Listing Two

/* PCFFT.H -- by J.G.G. Dobbe -- Headers for PCFFT.C */

#ifndef PCFFT_H                /* If not defined yet, use this file */
#define PCFFT_H

/* --------------------- External function ------------------------ */
void Fft(float *Re, float *Im, int Pwr, int Dir);
/* ---------------------------------------------------------------- */

Listing Three

/* PCFFTest.C -- by J.G.G. Dobbe -- Test program written in Turbo C/Borland C 
                 that uses the C or Assembler version of FFT. It depends on 
                 the type of FFT object file whether the C- or ASM-version is 
                 linked in. In both cases, the same FFT.H file is used.

/* --------------------- Include directives ----------------------- */
#include <stdio.h>
#include <conio.h>
#include "pcfft.h"

/* --------------------- Constant definition ---------------------- */
#define SIZE 16                     /* Size of data arrays (re, im) */

/* --------------------- Variable definitions --------------------- */
float re[SIZE];                  /* Array holding Real part of data */
float im[SIZE];             /* Array holding Imaginary part of data */

/* --------------------- Function implementations ----------------- */
/* --------------------- DispArr ---------------------------------- */
void DispArr(char *Txt)
  int i;                                            /* Loop counter */
  clrscr();                                         /* Clear screen */
  printf("\n%s:\n", Txt);                         /* Display header */
  for (i = 0; i < SIZE; i++)                 /* Display data points */
    printf("i = %4d: Re = %8.2f, Im = %8.2f\n", i, re[i], im[i]);
  printf("Press <ENTER> to continue\n");         /* Display message */
  getch();                                    /* Wait for keystroke */
/* --------------------- main ------------------------------------- */
int main()
  int i;
  for (i = 0; i < SIZE; i++)                   /* Clear data arrays */
    re[i] = 0.0;
    im[i] = 0.0;
  re[0] = 100.0;                    /* Fill array with pulse signal */
  DispArr("Input Time Data");   /* Display time data (pulse signal) */
  Fft(re, im, 4, 1);                          /* FFT on data points */
  DispArr("Frequency Response (after FFT)");   /* Display freq data */
  Fft(re, im, 4, -1);                   /* Inverse FFT on freq data */
  DispArr("Time Response (after inverse FFT)");
                                /* Display time data (pulse signal) */
  return 0;
/* ---------------------------------------------------------------- */

Copyright © 1995, Dr. Dobb's Journal

Back to Table of Contents