International
Tables for
Crystallography
Volume B
Reciprocal space
Edited by U. Shmueli

International Tables for Crystallography (2006). Vol. B, ch. 1.3, pp. 49-58   | 1 | 2 |

## Section 1.3.3. Numerical computation of the discrete Fourier transform

G. Bricognea

aMRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, England, and LURE, Bâtiment 209D, Université Paris-Sud, 91405 Orsay, France

### 1.3.3. Numerical computation of the discrete Fourier transform

| top | pdf |

#### 1.3.3.1. Introduction

| top | pdf |

The Fourier transformation's most remarkable property is undoubtedly that of turning convolution into multiplication. As distribution theory has shown, other valuable properties – such as the shift property, the conversion of differentiation into multiplication by monomials, and the duality between periodicity and sampling – are special instances of the convolution theorem.

This property is exploited in many areas of applied mathematics and engineering (Campbell & Foster, 1948; Sneddon, 1951; Champeney, 1973; Bracewell, 1986). For example, the passing of a signal through a linear filter, which results in its being convolved with the response of the filter to a δ-function impulse', may be modelled as a multiplication of the signal's transform by the transform of the impulse response (also called transfer function). Similarly, the solution of systems of partial differential equations may be turned by Fourier transformation into a division problem for distributions. In both cases, the formulations obtained after Fourier transformation are considerably simpler than the initial ones, and lend themselves to constructive solution techniques.

Whenever the functions to which the Fourier transform is applied are band-limited, or can be well approximated by band-limited functions, the discrete Fourier transform (DFT) provides a means of constructing explicit numerical solutions to the problems at hand. A great variety of investigations in physics, engineering and applied mathematics thus lead to DFT calculations, to such a degree that, at the time of writing, about 50% of all supercomputer CPU time is alleged to be spent calculating DFTs.

The straightforward use of the defining formulae for the DFT leads to calculations of size for N sample points, which become unfeasible for any but the smallest problems. Much ingenuity has therefore been exerted on the design and implementation of faster algorithms for calculating the DFT (McClellan & Rader, 1979; Nussbaumer, 1981; Blahut, 1985; Brigham, 1988). The most famous is that of Cooley & Tukey (1965) which heralded the age of digital signal processing. However, it had been preceded by the prime factor algorithm of Good (1958, 1960), which has lately been the basis of many new developments. Recent historical research (Goldstine, 1977, pp. 249–253; Heideman et al., 1984) has shown that Gauss essentially knew the Cooley–Tukey algorithm as early as 1805 (before Fourier's 1807 work on harmonic analysis!); while it has long been clear that Dirichlet knew of the basis of the prime factor algorithm and used it extensively in his theory of multiplicative characters [see e.g. Chapter I of Ayoub (1963), and Chapters 6 and 8 of Apostol (1976)]. Thus the computation of the DFT, far from being a purely technical and rather narrow piece of specialized numerical analysis, turns out to have very rich connections with such central areas of pure mathematics as number theory (algebraic and analytic), the representation theory of certain Lie groups and coding theory – to list only a few. The interested reader may consult Auslander & Tolimieri (1979); Auslander, Feig & Winograd (1982, 1984); Auslander & Tolimieri (1985); Tolimieri (1985).

One-dimensional algorithms are examined first. The Sande mixed-radix version of the Cooley–Tukey algorithm only calls upon the additive structure of congruence classes of integers. The prime factor algorithm of Good begins to exploit some of their multiplicative structure, and the use of relatively prime factors leads to a stronger factorization than that of Sande. Fuller use of the multiplicative structure, via the group of units, leads to the Rader algorithm; and the factorization of short convolutions then yields the Winograd algorithms.

Multidimensional algorithms are at first built as tensor products of one-dimensional elements. The problem of factoring the DFT in several dimensions simultaneously is then examined. The section ends with a survey of attempts at formalizing the interplay between algorithm structure and computer architecture for the purpose of automating the design of optimal DFT code.

It was originally intended to incorporate into this section a survey of all the basic notions and results of abstract algebra which are called upon in the course of these developments, but time limitations have made this impossible. This material, however, is adequately covered by the first chapter of Tolimieri et al. (1989) in a form tailored for the same purposes. Similarly, the inclusion of numerous detailed examples of the algorithms described here has had to be postponed to a later edition, but an abundant supply of such examples may be found in the signal processing literature, for instance in the books by McClellan & Rader (1979), Blahut (1985), and Tolimieri et al. (1989).

#### 1.3.3.2. One-dimensional algorithms

| top | pdf |

Throughout this section we will denote by the expression , . The mapping has the following properties: Thus e defines an isomorphism between the additive group (the reals modulo the integers) and the multiplicative group of complex numbers of modulus 1. It follows that the mapping , where and N is a positive integer, defines an isomorphism between the one-dimensional residual lattice and the multiplicative group of Nth roots of unity.

The DFT on N points then relates vectors X and in W and through the linear transformations:

#### 1.3.3.2.1. The Cooley–Tukey algorithm

| top | pdf |

The presentation of Gentleman & Sande (1966) will be followed first [see also Cochran et al. (1967)]. It will then be reinterpreted in geometric terms which will prepare the way for the treatment of multidimensional transforms in Section 1.3.3.3.

Suppose that the number of sample points N is composite, say . We may write k to the base and to the base as follows: The defining relation for may then be written: The argument of may be expanded as and the last summand, being an integer, may be dropped: This computation may be decomposed into five stages, as follows:

 (i) form the vectors of length by the prescription (ii) calculate the transforms on points: (iii) form the vectors of length by the prescription (iv) calculate the transforms on points: (v) collect as .

If the intermediate transforms in stages (ii) and (iv) are performed in place, i.e. with the results overwriting the data, then at stage (v) the result will be found at address . This phenomenon is called scrambling by digit reversal', and stage (v) is accordingly known as unscrambling.

The initial N-point transform has thus been performed as transforms on points, followed by transforms on points, thereby reducing the arithmetic cost from to . The phase shifts applied at stage (iii) are traditionally called twiddle factors', and the transposition between and can be performed by the fast recursive technique of Eklundh (1972). Clearly, this procedure can be applied recursively if and are themselves composite, leading to an overall arithmetic cost of order N log N if N has no large prime factors.

The Cooley–Tukey factorization may also be derived from a geometric rather than arithmetic argument. The decomposition is associated to a geometric partition of the residual lattice into copies of , each translated by and blown up' by a factor . This partition in turn induces a (direct sum) decomposition of X as where

According to (i), is related to by decimation by and offset by . By Section 1.3.2.7.2, is related to by periodization by and phase shift by , so that the periodization by being reflected by the fact that does not depend on . Writing and expanding shows that the phase shift contains both the twiddle factor and the kernel of . The Cooley–Tukey algorithm is thus naturally associated to the coset decomposition of a lattice modulo a sublattice (Section 1.3.2.7.2).

It is readily seen that essentially the same factorization can be obtained for , up to the complex conjugation of the twiddle factors. The normalizing constant arises from the normalizing constants and in and , respectively.

Factors of 2 are particularly simple to deal with and give rise to a characteristic computational structure called a butterfly loop'. If , then two options exist:

 (a) using and leads to collecting the even-numbered coordinates of X into and the odd-numbered coordinates into and writing: This is the original version of Cooley & Tukey, and the process of formation of and is referred to as decimation in time' (i.e. decimation along the data index k). (b) using and leads to forming then obtaining separately the even-numbered and odd-numbered components of by transforming and : This version is due to Sande (Gentleman & Sande, 1966), and the process of separately obtaining even-numbered and odd-numbered results has led to its being referred to as decimation in frequency' (i.e. decimation along the result index ).

By repeated factoring of the number N of sample points, the calculation of and can be reduced to a succession of stages, the smallest of which operate on single prime factors of N. The reader is referred to Gentleman & Sande (1966) for a particularly lucid analysis of the programming considerations which help implement this factorization efficiently; see also Singleton (1969). Powers of two are often grouped together into factors of 4 or 8, which are advantageous in that they require fewer complex multiplications than the repeated use of factors of 2. In this approach, large prime factors P are detrimental, since they require a full -size computation according to the defining formula.

| top | pdf |

#### 1.3.3.2.2.1. Ring structure on

| top | pdf |

The set of congruence classes of integers modulo an integer N [see e.g. Apostol (1976), Chapter 5] inherits from not only the additive structure used in deriving the Cooley–Tukey factorization, but also a multiplicative structure in which the product of two congruence classes mod N is uniquely defined as the class of the ordinary product (in ) of representatives of each class. The multiplication can be distributed over addition in the usual way, endowing with the structure of a commutative ring.

If N is composite, the ring has zero divisors. For example, let , let mod N, and let mod N: then mod N. In the general case, a product of non-zero elements will be zero whenever these elements collect together all the factors of N. These circumstances give rise to a fundamental theorem in the theory of commutative rings, the Chinese Remainder Theorem (CRT), which will now be stated and proved [see Apostol (1976), Chapter 5; Schroeder (1986), Chapter 16].

#### 1.3.3.2.2.2. The Chinese remainder theorem

| top | pdf |

Let be factored into a product of pairwise coprime integers, so that g.c.d. for . Then the system of congruence equations has a unique solution mod N. In other words, each is associated in a one-to-one fashion to the d-tuple of its residue classes in .

The proof of the CRT goes as follows. Let Since g.c.d. there exist integers and such that then the integer is the solution. Indeed, because all terms with contain as a factor; and by the defining relation for .

It may be noted that so that the are mutually orthogonal idempotents in the ring , with properties formally similar to those of mutually orthogonal projectors onto subspaces in linear algebra. The analogy is exact, since by virtue of the CRT the ring may be considered as the direct product via the two mutually inverse mappings:

 (i) by mod for each j; (ii) .

The mapping defined by (ii) is sometimes called the CRT reconstruction' of from the .

These two mappings have the property of sending sums to sums and products to products, i.e: (the last proof requires using the properties of the idempotents ). This may be described formally by stating that the CRT establishes a ring isomorphism:

#### 1.3.3.2.2.3. The prime factor algorithm

| top | pdf |

The CRT will now be used to factor the N-point DFT into a tensor product of d transforms, the jth of length .

Let the indices k and be subjected to the following mappings:

 (i) , by mod for each j, with reconstruction formula (ii) , by mod for each j, with reconstruction formula

Then Cross terms with vanish since they contain all the factors of N, hence Dividing by N, which may be written as for each j, yields and hence Therefore, by the multiplicative property of ,

Let be described by a one-dimensional array indexed by k. The index mapping (i) turns X into an element of described by a d-dimensional array ; the latter may be transformed by into a new array . Finally, the one-dimensional array of results will be obtained by reconstructing according to (ii).

The prime factor algorithm, like the Cooley–Tukey algorithm, reindexes a 1D transform to turn it into d separate transforms, but the use of coprime factors and CRT index mapping leads to the further gain that no twiddle factors need to be applied between the successive transforms (see Good, 1971). This makes up for the cost of the added complexity of the CRT index mapping.

The natural factorization of N for the prime factor algorithm is thus its factorization into prime powers: is then the tensor product of separate transforms (one for each prime power factor ) whose results can be reassembled without twiddle factors. The separate factors within each must then be dealt with by another algorithm (e.g. Cooley–Tukey, which does require twiddle factors). Thus, the DFT on a prime number of points remains undecomposable.

| top | pdf |

The previous two algorithms essentially reduce the calculation of the DFT on N points for N composite to the calculation of smaller DFTs on prime numbers of points, the latter remaining irreducible. However, Rader (1968) showed that the p-point DFT for p an odd prime can itself be factored by invoking some extra arithmetic structure present in .

#### 1.3.3.2.3.1. N an odd prime

| top | pdf |

The ring has the property that its non-zero elements, called units, form a multiplicative group . In particular, all units have a unique multiplicative inverse in , i.e. a unit such that . This endows with the structure of a finite field.

Furthermore, is a cyclic group, i.e. consists of the successive powers of a generator g called a primitive root mod p (such a g may not be unique, but it always exists). For instance, for , is generated by , whose successive powers mod 7 are: [see Apostol (1976), Chapter 10].

The basis of Rader's algorithm is to bring to light a hidden regularity in the matrix by permuting the basis vectors and of as follows: where g is a primitive root mod p.

With respect to these new bases, the matrix representing will have the following elements: Thus the core' of matrix , of size , formed by the elements with two non-zero indices, has a so-called skew-circulant structure because element depends only on . Simplification may now occur because multiplication by is closely related to a cyclic convolution. Introducing the notation we may write the relation in the permuted bases as where Z is defined by , .

Thus may be obtained by cyclic convolution of C and Z, which may for instance be calculated by where × denotes the component-wise multiplication of vectors. Since p is odd, is always divisible by 2 and may even be highly composite. In that case, factoring by means of the Cooley–Tukey or Good methods leads to an algorithm of complexity p log p rather than for . An added bonus is that, because , the elements of can be shown to be either purely real or purely imaginary, which halves the number of real multiplications involved.

#### 1.3.3.2.3.2. N a power of an odd prime

| top | pdf |

This idea was extended by Winograd (1976, 1978) to the treatment of prime powers , using the cyclic structure of the multiplicative group of units . The latter consists of all those elements of which are not divisible by p, and thus has elements. It is cyclic, and there exist primitive roots g modulo such that The elements divisible by p, which are divisors of zero, have to be treated separately just as 0 had to be treated separately for .

When , then with . The results are p-decimated, hence can be obtained via the -point DFT of the -periodized data Y: with

When , then we may write where contains the contributions from and those from . By a converse of the previous calculation, arises from p-decimated data Z, hence is the -periodization of the -point DFT of these data: with (the -periodicity follows implicity from the fact that the transform on the right-hand side is independent of ).

Finally, the contribution from all may be calculated by reindexing by the powers of a primitive root g modulo , i.e. by writing then carrying out the multiplication by the skew-circulant matrix core as a convolution.

Thus the DFT of size may be reduced to two DFTs of size (dealing, respectively, with p-decimated results and p-decimated data) and a convolution of size . The latter may be diagonalized' into a multiplication by purely real or purely imaginary numbers (because ) by two DFTs, whose factoring in turn leads to DFTs of size and . This method, applied recursively, allows the complete decomposition of the DFT on points into arbitrarily small DFTs.

#### 1.3.3.2.3.3. N a power of 2

| top | pdf |

When , the same method can be applied, except for a slight modification in the calculation of . There is no primitive root modulo for : the group is the direct product of two cyclic groups, the first (of order 2) generated by −1, the second (of order ) generated by 3 or 5. One then uses a representation and the reindexed core matrix gives rise to a two-dimensional convolution. The latter may be carried out by means of two 2D DFTs on points.

| top | pdf |

The cyclic convolutions generated by Rader's multiplicative reindexing may be evaluated more economically than through DFTs if they are re-examined within a new algebraic setting, namely the theory of congruence classes of polynomials [see, for instance, Blahut (1985), Chapter 2; Schroeder (1986), Chapter 24].

The set, denoted , of polynomials in one variable with coefficients in a given field has many of the formal properties of the set of rational integers: it is a ring with no zero divisors and has a Euclidean algorithm on which a theory of divisibility can be built.

Given a polynomial , then for every there exist unique polynomials and such that and is called the residue of modulo . Two polynomials and having the same residue modulo are said to be congruent modulo , which is denoted by

If is said to be divisible by . If only has divisors of degree zero in , it is said to be irreducible over (this notion depends on ). Irreducible polynomials play in a role analogous to that of prime numbers in , and any polynomial over has an essentially unique factorization as a product of irreducible polynomials.

There exists a Chinese remainder theorem (CRT) for polynomials. Let be factored into a product of pairwise coprime polynomials [i.e. and have no common factor for ]. Then the system of congruence equations has a unique solution modulo . This solution may be constructed by a procedure similar to that used for integers. Let Then and are coprime, and the Euclidean algorithm may be used to obtain polynomials and such that With , the polynomial is easily shown to be the desired solution.

As with integers, it can be shown that the 1:1 correspondence between and sends sums to sums and products to products, i.e. establishes a ring isomorphism:

These results will now be applied to the efficient calculation of cyclic convolutions. Let and be two vectors of length N, and let be obtained by cyclic convolution of U and V: The very simple but crucial result is that this cyclic convolution may be carried out by polynomial multiplication modulo : if then the above relation is equivalent to Now the polynomial can be factored over the field of rational numbers into irreducible factors called cyclotomic polynomials: if d is the number of divisors of N, including 1 and N, then where the cyclotomics are well known (Nussbaumer, 1981; Schroeder, 1986, Chapter 22). We may now invoke the CRT, and exploit the ring isomorphism it establishes to simplify the calculation of from and as follows:

 (i) compute the d residual polynomials (ii) compute the d polynomial products (iii) use the CRT reconstruction formula just proved to recover from the :

When N is not too large, i.e. for short cyclic convolutions', the are very simple, with coefficients 0 or ±1, so that (i) only involves a small number of additions. Furthermore, special techniques have been developed to multiply general polynomials modulo cyclotomic polynomials, thus helping keep the number of multiplications in (ii) and (iii) to a minimum. As a result, cyclic convolutions can be calculated rapidly when N is sufficiently composite.

It will be recalled that Rader's multiplicative indexing often gives rise to cyclic convolutions of length for p an odd prime. Since is highly composite for all other than 23 and 47, these cyclic convolutions can be performed more efficiently by the above procedure than by DFT.

These combined algorithms are due to Winograd (1977, 1978, 1980), and are known collectively as Winograd small FFT algorithms'. Winograd also showed that they can be thought of as bringing the DFT matrix F to the following normal form': where

 A is an integer matrix with entries 0, , defining the pre-additions', B is a diagonal matrix of multiplications, C is a matrix with entries 0, , , defining the post-additions'.

The elements on the diagonal of B can be shown to be either real or pure imaginary, by the same argument as in Section 1.3.3.2.3.1. Matrices A and C may be rectangular rather than square, so that intermediate results may require extra storage space.

#### 1.3.3.3. Multidimensional algorithms

| top | pdf |

From an algorithmic point of view, the distinction between one-dimensional (1D) and multidimensional DFTs is somewhat blurred by the fact that some factoring techniques turn a 1D transform into a multidimensional one. The distinction made here, however, is a practical one and is based on the dimensionality of the indexing sets for data and results. This section will therefore be concerned with the problem of factoring the DFT when the indexing sets for the input data and output results are multidimensional.

#### 1.3.3.3.1. The method of successive one-dimensional transforms

| top | pdf |

The DFT was defined in Section 1.3.2.7.4 in an n-dimensional setting and it was shown that when the decimation matrix N is diagonal, say , then has a tensor product structure: This may be rewritten as follows: where the I's are identity matrices and × denotes ordinary matrix multiplication. The matrix within each bracket represents a one-dimensional DFT along one of the n dimensions, the other dimensions being left untransformed. As these matrices commute, the order in which the successive 1D DFTs are performed is immaterial.

This is the most straightforward method for building an n-dimensional algorithm from existing 1D algorithms. It is known in crystallography under the name of Beevers–Lipson factorization' (Section 1.3.4.3.1), and in signal processing as the row–column method'.

#### 1.3.3.3.2. Multidimensional factorization

| top | pdf |

Substantial reductions in the arithmetic cost, as well as gains in flexibility, can be obtained if the factoring of the DFT is carried out in several dimensions simultaneously. The presentation given here is a generalization of that of Mersereau & Speake (1981), using the abstract setting established independently by Auslander, Tolimieri & Winograd (1982).

Let us return to the general n-dimensional setting of Section 1.3.2.7.4, where the DFT was defined for an arbitrary decimation matrix N by the formulae (where denotes ): with

#### 1.3.3.3.2.1. Multidimensional Cooley–Tukey factorization

| top | pdf |

Let us now assume that this decimation can be factored into d successive decimations, i.e. that and hence Then the coset decomposition formulae corresponding to these successive decimations (Section 1.3.2.7.1) can be combined as follows: with . Therefore, any may be written uniquely as Similarly: so that any may be written uniquely as with . These decompositions are the vector analogues of the multi-radix number representation systems used in the Cooley–Tukey factorization.

We may then write the definition of with factors as The argument of e(–) may be expanded as The first summand may be recognized as a twiddle factor, the second and third as the kernels of and , respectively, while the fourth is an integer which may be dropped. We are thus led to a vector-radix' version of the Cooley–Tukey algorithm, in which the successive decimations may be introduced in all n dimensions simultaneously by general integer matrices. The computation may be decomposed into five stages analogous to those of the one-dimensional algorithm of Section 1.3.3.2.1:

 (i) form the vectors of shape by (ii) calculate the transforms on points: (iii) form the vectors of shape by (iv) calculate the transforms on points: (v) collect as .

The initial -point transform can thus be performed as transforms on points, followed by transforms on points. This process can be applied successively to all d factors. The same decomposition applies to , up to the complex conjugation of twiddle factors, the normalization factor being obtained as the product of the factors in the successive partial transforms .

The geometric interpretation of this factorization in terms of partial transforms on translates of sublattices applies in full to this n-dimensional setting; in particular, the twiddle factors are seen to be related to the residual translations which place the sublattices in register within the big lattice. If the intermediate transforms are performed in place, then the quantity will eventually be found at location so that the final results will have to be unscrambled by a process which may be called coset reversal', the vector equivalent of digit reversal.

Factoring by 2 in all n dimensions simultaneously, i.e. taking , leads to n-dimensional butterflies'. Decimation in time corresponds to the choice , so that is an n-dimensional parity class; the calculation then proceeds by Decimation in frequency corresponds to the choice , , so that labels octant' blocks of shape M; the calculation then proceeds through the following steps: i.e. the parity classes of results, corresponding to the different , are obtained separately. When the dimension n is 2 and the decimating matrix is diagonal, this analysis reduces to the vector radix FFT' algorithms proposed by Rivard (1977) and Harris et al. (1977). These lead to substantial reductions in the number M of multiplications compared to the row–column method: M is reduced to by simultaneous factoring, and to by simultaneous factoring.

The use of a non-diagonal decimating matrix may bring savings in computing time if the spectrum of the band-limited function under study is of such a shape as to pack more compactly in a non-rectangular than in a rectangular lattice (Mersereau, 1979). If, for instance, the support K of the spectrum Φ is contained in a sphere, then a decimation matrix producing a close packing of these spheres will yield an aliasing-free DFT algorithm with fewer sample points than the standard algorithm using a rectangular lattice.

#### 1.3.3.3.2.2. Multidimensional prime factor algorithm

| top | pdf |

Suppose that the decimation matrix N is diagonal and let each diagonal element be written in terms of its prime factors: where m is the total number of distinct prime factors present in the .

The CRT may be used to turn each 1D transform along dimension i into a multidimensional transform with a separate pseudo-dimension' for each distinct prime factor of ; the number , of these pseudo-dimensions is equal to the cardinality of the set: The full n-dimensional transform thus becomes μ-dimensional, with .

We may now permute the μ pseudo-dimensions so as to bring into contiguous position those corresponding to the same prime factor ; the m resulting groups of pseudo-dimensions are said to define p-primary' blocks. The initial transform is now written as a tensor product of m p-primary transforms, where transform j is on points [by convention, dimension i is not transformed if ]. These p-primary transforms may be computed, for instance, by multidimensional Cooley–Tukey factorization (Section 1.3.3.3.1), which is faster than the straightforward row–column method. The final results may then be obtained by reversing all the permutations used.

The extra gain with respect to the multidimensional Cooley–Tukey method is that there are no twiddle factors between p-primary pieces corresponding to different primes p.

The case where N is not diagonal has been examined by Guessoum & Mersereau (1986).

#### 1.3.3.3.2.3. Nesting of Winograd small FFTs

| top | pdf |

Suppose that the CRT has been used as above to map an n-dimensional DFT to a μ-dimensional DFT. For each [κ runs over those pairs (i, j) such that ], the Rader/Winograd procedure may be applied to put the matrix of the κth 1D DFT in the CBA normal form of a Winograd small FFT. The full DFT matrix may then be written, up to permutation of data and results, as

A well known property of the tensor product of matrices allows this to be rewritten as and thus to form a matrix in which the combined pre-addition, multiplication and post-addition matrices have been precomputed. This procedure, called nesting, can be shown to afford a reduction of the arithmetic operation count compared to the row–column method (Morris, 1978).

Clearly, the nesting rearrangement need not be applied to all μ dimensions, but can be restricted to any desired subset of them.

#### 1.3.3.3.2.4. The Nussbaumer–Quandalle algorithm

| top | pdf |

Nussbaumer's approach views the DFT as the evaluation of certain polynomials constructed from the data (as in Section 1.3.3.2.4). For instance, putting , the 1D N-point DFT may be written where the polynomial Q is defined by

Let us consider (Nussbaumer & Quandalle, 1979) a 2D transform of size : By introduction of the polynomials this may be rewritten:

Let us now suppose that is coprime to N. Then has a unique inverse modulo N (denoted by ), so that multiplication by simply permutes the elements of and hence for any function f over . We may thus write: where Since only the value of polynomial at is involved in the result, the computation of may be carried out modulo the unique cyclotomic polynomial such that . Thus, if we define: we may write: or equivalently

For N an odd prime p, all non-zero values of are coprime with p so that the -point DFT may be calculated as follows:

 (1) form the polynomials for ; (2) evaluate for ; (3) put ; (4) calculate the terms for separately by

Step (1) is a set of p polynomial transforms' involving no multiplications; step (2) consists of p DFTs on p points each since if then step (3) is a permutation; and step (4) is a p-point DFT. Thus the 2D DFT on points, which takes 2p p-point DFTs by the row–column method, involves only p-point DFTs; the other DFTs have been replaced by polynomial transforms involving only additions.

This procedure can be extended to n dimensions, and reduces the number of 1D p-point DFTs from for the row–column method to , at the cost of introducing extra additions in the polynomial transforms.

A similar algorithm has been formulated by Auslander et al. (1983) in terms of Galois theory.

| top | pdf |

#### 1.3.3.3.3.1. From local pieces to global algorithms

| top | pdf |

The mathematical analysis of the structure of DFT computations has brought to light a broad variety of possibilities for reducing or reshaping their arithmetic complexity. All of them are analytic' in that they break down large transforms into a succession of smaller ones.

These results may now be considered from the converse synthetic' viewpoint as providing a list of procedures for assembling them:

 (i) the building blocks are one-dimensional p-point algorithms for p a small prime; (ii) the low-level connectors are the multiplicative reindexing methods of Rader and Winograd, or the polynomial transform reindexing method of Nussbaumer and Quandalle, which allow the construction of efficient algorithms for larger primes p, for prime powers , and for p-primary pieces of shape ; (iii) the high-level connectors are the additive reindexing scheme of Cooley–Tukey, the Chinese remainder theorem reindexing, and the tensor product construction; (iv) nesting may be viewed as the glue' which seals all elements.

The simplest DFT may then be carried out into a global algorithm in many different ways. The diagrams in Fig. 1.3.3.1 illustrate a few of the options available to compute a 400-point DFT. They may differ greatly in their arithmetic operation counts.

 Figure 1.3.3.1 | top | pdf |A few global algorithms for computing a 400-point DFT. CT: Cooley–Tukey factorization. PF: prime factor (or Good) factorization. W: Winograd algorithm.

#### 1.3.3.3.3.2. Computer architecture considerations

| top | pdf |

To obtain a truly useful measure of the computational complexity of a DFT algorithm, its arithmetic operation count must be tempered by computer architecture considerations. Three main types of trade-offs must be borne in mind:

 (i) reductions in floating-point (f.p.) arithmetic count are obtained by reindexing, hence at the cost of an increase in integer arithmetic on addresses, although some shortcuts may be found (Uhrich, 1969; Burrus & Eschenbacher, 1981); (ii) reduction in the f.p. multiplication count usually leads to a large increase in the f.p. addition count (Morris, 1978); (iii) nesting can increase execution speed, but causes a loss of modularity and hence complicates program development (Silverman, 1977; Kolba & Parks, 1977).

Many of the mathematical developments above took place in the context of single-processor serial computers, where f.p. addition is substantially cheaper than f.p. multiplication but where integer address arithmetic has to compete with f.p. arithmetic for processor cycles. As a result, the alternatives to the Cooley–Tukey algorithm hardly ever led to particularly favourable trade-offs, thus creating the impression that there was little to gain by switching to more exotic algorithms.

The advent of new machine architectures with vector and/or parallel processing features has greatly altered this picture (Pease, 1968; Korn & Lambiotte, 1979; Fornberg, 1981; Swartzrauber, 1984):

 (i) pipelining equalizes the cost of f.p. addition and f.p. multiplication, and the ideal blend' of the two types of operations depends solely on the number of adder and multiplier units available in each machine; (ii) integer address arithmetic is delegated to specialized arithmetic and logical units (ALUs) operating concurrently with the f.p. units, so that complex reindexing schemes may be used without loss of overall efficiency.

Another major consideration is that of data flow [see e.g. Nawab & McClellan (1979)]. Serial machines only have few registers and few paths connecting them, and allow little or no overlap between computation and data movement. New architectures, on the other hand, comprise banks of vector registers (or cache memory') besides the usual internal registers, and dedicated ALUs can service data transfers between several of them simultaneously and concurrently with computation.

In this new context, the devices described in Sections 1.3.3.2 and 1.3.3.3 for altering the balance between the various types of arithmetic operations, and reshaping the data flow during the computation, are invaluable. The field of machine-dependent DFT algorithm design is thriving on them [see e.g. Temperton (1983a,b,c, 1985); Agarwal & Cooley (1986, 1987)].

#### 1.3.3.3.3.3. The Johnson–Burrus family of algorithms

| top | pdf |

In order to explore systematically all possible algorithms for carrying out a given DFT computation, and to pick the one best suited to a given machine, attempts have been made to develop:

 (i) a high-level notation of describing all the ingredients of a DFT computation, including data permutation and data flow; (ii) a formal calculus capable of operating on these descriptions so as to represent all possible reorganizations of the computation; (iii) an automatic procedure for evaluating the performance of a given algorithm on a specific architecture.

Task (i) can be accomplished by systematic use of a tensor product notation to represent the various stages into which the DFT can be factored (reindexing, small transforms on subsets of indices, twiddle factors, digit-reversal permutations).

Task (ii) may for instance use the Winograd CBA normal form for each small transform, then apply the rules governing the rearrangement of tensor product and ordinary product × operations on matrices. The matching of these rearrangements to the architecture of a vector and/or parallel computer can be formalized algebraically [see e.g. Chapter 2 of Tolimieri et al. (1989)].

Task (iii) is a complex search which requires techniques such as dynamic programming (Bellman, 1958).

Johnson & Burrus (1983) have proposed and tested such a scheme to identify the optimal trade-offs between prime factor nesting and Winograd nesting of small Winograd transforms. In step (ii), they further decomposed the pre-addition matrix A and post-addition matrix C into several factors, so that the number of design options available becomes very large: the N-point DFT when N has four factors can be calculated in over 1012 distinct ways.

This large family of nested algorithms contains the prime factor algorithm and the Winograd algorithms as particular cases, but usually achieves greater efficiency than either by reducing the f.p. multiplication count while keeping the number of f.p. additions small.

There is little doubt that this systematic approach will be extended so as to incorporate all available methods of restructuring the DFT.

### References

Agarwal, R. C. & Cooley, J. W. (1986). Fourier transform and convolution subroutines for the IBM 3090 Vector facility. IBM J. Res. Dev. 30, 145–162.
Agarwal, R. C. & Cooley, J. W. (1987). Vectorized mixed radix discrete Fourier transform algorithms. Proc. IEEE, 75, 1283–1292.
Apostol, T. M. (1976). Introduction to analytic number theory. New York: Springer-Verlag.
Auslander, L., Feig, E. & Winograd, S. (1982). New algorithms for evaluating the 3-dimensional discrete Fourier transform. In Computational crystallography, edited by D. Sayre, pp. 451–461. New York: Oxford University Press.
Auslander, L., Feig, E. & Winograd, S. (1983). New algorithms for the multidimensional discrete Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 31, 388–403.
Auslander, L., Feig, E. & Winograd, S. (1984). Abelian semi-simple algebras and algorithms for the discrete Fourier transform. Adv. Appl. Math. 5, 31–55.
Auslander, L. & Tolimieri, R. (1979). Is computing with the finite Fourier transform pure or applied mathematics? Bull. Am. Math. Soc. 1, 847–897.
Auslander, L. & Tolimieri, R. (1985). Ring structure and the Fourier transform. Math. Intelligencer, 7, 49–54.
Auslander, L., Tolimieri, R. & Winograd, S. (1982). Hecke's theorem in quadratic reciprocity, finite nilpotent groups and the Cooley–Tukey algorithm. Adv. Math. 43, 122–172.
Ayoub, R. (1963). An introduction to the analytic theory of numbers. Providence, RI: American Mathematical Society.
Bellman, R. (1958). Dynamic programming and stochastic control processes. Inf. Control, 1, 228–239.
Blahut, R. E. (1985). Fast algorithms for digital signal processing. Reading: Addison-Wesley.
Bracewell, R. N. (1986). The Fourier transform and its applications, 2nd ed., revised. New York: McGraw-Hill.
Brigham, E. O. (1988). The fast Fourier transform and its applications. Englewood Cliffs: Prentice-Hall.
Burrus, C. S. & Eschenbacher, P. W. (1981). An in-place, in-order prime factor FFT algorithm. IEEE Trans. Acoust. Speech Signal Process. 29, 806–817.
Campbell, G. A. & Foster, R. M. (1948). Fourier integrals for practical applications. Princeton: Van Nostrand.
Champeney, D. C. (1973). Fourier transforms and their physical applications. New York: Academic Press.
Cochran, W. T., Cooley, J. W., Favin, D. L., Helms, H. D., Kaenel, R. A., Lang, W. W., Maling, G. C., Nelson, D. E., Rader, C. M. & Welch, P. D. (1967). What is the fast Fourier transform? IEEE Trans. Audio, 15, 45–55.
Cooley, J. W. & Tukey, J. W. (1965). An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19, 297–301.
Eklundh, J. O. (1972). A fast computer method for matrix transposing. IEEE Trans. C-21, 801–803.
Fornberg, A. (1981). A vector implementation of the fast Fourier transform algorithm. Math. Comput. 36, 189–191.
Gentleman, W. M. & Sande, G. (1966). Fast Fourier transforms – for fun and profit. In AFIPS Proc. 1966 Fall Joint Computer Conference, pp. 563–578. Washington, DC: Spartan Books.
Goldstine, H. H. (1977). A history of numerical analysis from the 16th through the 19th century. New York: Springer-Verlag.
Good, I. J. (1958). The interaction algorithm and practical Fourier analysis. J. R. Stat. Soc. B20, 361–372.
Good, I. J. (1960). Addendum [to Good (1958)]. J. R. Stat. Soc. B22, 372–375.
Good, I. J. (1971). The relationship between two fast Fourier transforms. IEEE Trans. C-20, 310–317.
Guessoum, A. & Mersereau, R. M. (1986). Fast algorithms for the multidimensional discrete Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 34, 937–943.
Harris, D. B., McClellan, J. H., Chan, D. S. K. & Schuessler, H. W. (1977). Vector radix fast Fourier transform. Rec. 1977 IEEE Internat. Conf. Acoust. Speech Signal Process. pp. 548–551.
Heideman, M. T., Johnson, D. H. & Burrus, C. S. (1984). Gauss and the history of the fast Fourier transform. IEEE Acoust. Speech Signal Process. Mag. October 1984.
Johnson, H. W. & Burrus, C. S. (1983). The design of optimal DFT algorithms using dynamic programming. IEEE Trans. Acoust. Speech Signal Process. 31, 378–387.
Kolba, D. P. & Parks, T. W. (1977). A prime factor FFT algorithm using high-speed convolution. IEEE Trans. Acoust. Speech Signal Process. 25, 281–294.
Korn, D. G. & Lambiotte, J. J. Jr (1979). Computing the fast Fourier transform on a vector computer. Math. Comput. 33, 977–992.
McClellan, J. H. & Rader, C. M. (1979). Number theory in digital signal processing. Englewood Cliffs: Prentice Hall.
Mersereau, R. & Speake, T. C. (1981). A unified treatment of Cooley–Tukey algorithms for the evaluation of the multidimensional DFT. IEEE Trans. Acoust. Speech Signal Process. 29, 1011–1018.
Mersereau, R. M. (1979). The processing of hexagonally sampled two-dimensional signals. Proc. IEEE, 67, 930–949.
Morris, R. L. (1978). A comparative study of time efficient FFT and WFTA programs for general purpose computers. IEEE Trans. Acoust. Speech Signal Process. 26, 141–150.
Nawab, H. & McClellan, J. H. (1979). Bounds on the minimum number of data transfers in WFTA and FFT programs. IEEE Trans. Acoust. Speech Signal Process. 27, 393–398.
Nussbaumer, H. J. (1981). Fast Fourier transform and convolution algorithms. Berlin: Springer-Verlag.
Nussbaumer, H. J. & Quandalle, P. (1979). Fast computation of discrete Fourier transforms using polynomial transforms. IEEE Trans. Acoust. Speech Signal Process. 27, 169–181.
Pease, M. C. (1968). An adaptation of the fast Fourier transform for parallel processing. J. Assoc. Comput. Mach. 15, 252–264.
Rader, C. M. (1968). Discrete Fourier transforms when the number of data samples is prime. Proc. IEEE, 56, 1107–1108.
Rivard, G. E. (1977). Direct fast Fourier transform of bivariate functions. IEEE Trans. Acoust. Speech Signal Process. 25, 250–252.
Schroeder, M. R. (1986). Number theory in science and communication, 2nd ed. Berlin: Springer-Verlag.
Silverman, H. F. (1977). An introduction to programming the Winograd Fourier transform algorithm (WFTA). IEEE Trans. Acoust. Speech Signal Process. 25, 152–165.
Singleton, R. C. (1969). An algorithm for computing the mixed radix fast Fourier transform. IEEE Trans. AU, 17, 93–103.
Sneddon, I. N. (1951). Fourier transforms. New York: McGraw-Hill.
Swartzrauber, P. N. (1984). FFT algorithms for vector computers. Parallel Comput. 1, 45–63.
Temperton, C. (1983a). Self-sorting mixed-radix fast Fourier transforms. J. Comput. Phys. 52, 1–23.
Temperton, C. (1983b). A note on the prime factor FFT algorithm. J. Comput. Phys. 52, 198–204.
Temperton, C. (1983c). Fast mixed-radix real Fourier transforms. J. Comput. Phys. 52, 340–350.
Temperton, C. (1985). Implementation of a self-sorting in place prime factor FFT algorithm. J. Comput. Phys. 58, 283–299.
Tolimieri, R. (1985). The algebra of the finite Fourier transform and coding theory. Trans. Am. Math. Soc. 287, 253–273.
Tolimieri, R., An, M. & Lu, C. (1989). Algorithms for discrete Fourier transform and convolution. New York: Springer-Verlag.
Uhrich, M. L. (1969). Fast Fourier transforms without sorting. IEEE Trans. AU, 17, 170–172.
Winograd, S. (1976). On computing the discrete Fourier transform. Proc. Natl Acad. Sci. USA, 73, 1005–1006.
Winograd, S. (1977). Some bilinear forms whose multiplicative complexity depends on the field of constants. Math. Syst. Theor. 10, 169–180.
Winograd, S. (1978). On computing the discrete Fourier transform. Math. Comput. 32, 175–199.
Winograd, S. (1980). Arithmetic complexity of computations. CBMS-NST Regional Conf. Series Appl. Math, Publ. No. 33. Philadelphia: SIAM Publications.