International
Tables for Crystallography Volume B Reciprocal space Edited by U. Shmueli © International Union of Crystallography 2010 
International Tables for Crystallography (2010). Vol. B, ch. 1.3, pp. 5862

From an algorithmic point of view, the distinction between onedimensional (1D) and multidimensional DFTs is somewhat blurred by the fact that some factoring techniques turn a 1D transform into a multidimensional one. The distinction made here, however, is a practical one and is based on the dimensionality of the indexing sets for data and results. This section will therefore be concerned with the problem of factoring the DFT when the indexing sets for the input data and output results are multidimensional.
The DFT was defined in Section 1.3.2.7.4 in an ndimensional setting and it was shown that when the decimation matrix N is diagonal, say , then has a tensor product structure:This may be rewritten as follows:where the I's are identity matrices and × denotes ordinary matrix multiplication. The matrix within each bracket represents a onedimensional DFT along one of the n dimensions, the other dimensions being left untransformed. As these matrices commute, the order in which the successive 1D DFTs are performed is immaterial.
This is the most straightforward method for building an ndimensional algorithm from existing 1D algorithms. It is known in crystallography under the name of `Beevers–Lipson factorization' (Section 1.3.4.3.1), and in signal processing as the `row–column method'.
Substantial reductions in the arithmetic cost, as well as gains in flexibility, can be obtained if the factoring of the DFT is carried out in several dimensions simultaneously. The presentation given here is a generalization of that of Mersereau & Speake (1981), using the abstract setting established independently by Auslander, Tolimieri & Winograd (1982).
Let us return to the general ndimensional setting of Section 1.3.2.7.4, where the DFT was defined for an arbitrary decimation matrix N by the formulae (where denotes ):with
Let us now assume that this decimation can be factored into d successive decimations, i.e. thatand henceThen the coset decomposition formulae corresponding to these successive decimations (Section 1.3.2.7.1) can be combined as follows:with . Therefore, any may be written uniquely asSimilarly:so that any may be written uniquely aswith . These decompositions are the vector analogues of the multiradix number representation systems used in the Cooley–Tukey factorization.
We may then write the definition of with factors asThe argument of e(–) may be expanded asThe first summand may be recognized as a twiddle factor, the second and third as the kernels of and , respectively, while the fourth is an integer which may be dropped. We are thus led to a `vectorradix' version of the Cooley–Tukey algorithm, in which the successive decimations may be introduced in all n dimensions simultaneously by general integer matrices. The computation may be decomposed into five stages analogous to those of the onedimensional algorithm of Section 1.3.3.2.1:
The initial point transform can thus be performed as transforms on points, followed by transforms on points. This process can be applied successively to all d factors. The same decomposition applies to , up to the complex conjugation of twiddle factors, the normalization factor being obtained as the product of the factors in the successive partial transforms .
The geometric interpretation of this factorization in terms of partial transforms on translates of sublattices applies in full to this ndimensional setting; in particular, the twiddle factors are seen to be related to the residual translations which place the sublattices in register within the big lattice. If the intermediate transforms are performed in place, then the quantitywill eventually be found at locationso that the final results will have to be unscrambled by a process which may be called `coset reversal', the vector equivalent of digit reversal.
Factoring by 2 in all n dimensions simultaneously, i.e. taking , leads to `ndimensional butterflies'. Decimation in time corresponds to the choice , so that is an ndimensional parity class; the calculation then proceeds byDecimation in frequency corresponds to the choice , , so that labels `octant' blocks of shape M; the calculation then proceeds through the following steps:i.e. the parity classes of results, corresponding to the different , are obtained separately. When the dimension n is 2 and the decimating matrix is diagonal, this analysis reduces to the `vector radix FFT' algorithms proposed by Rivard (1977) and Harris et al. (1977). These lead to substantial reductions in the number M of multiplications compared to the row–column method: M is reduced to by simultaneous factoring, and to by simultaneous factoring.
The use of a nondiagonal decimating matrix may bring savings in computing time if the spectrum of the bandlimited function under study is of such a shape as to pack more compactly in a nonrectangular than in a rectangular lattice (Mersereau, 1979). If, for instance, the support K of the spectrum Φ is contained in a sphere, then a decimation matrix producing a close packing of these spheres will yield an aliasingfree DFT algorithm with fewer sample points than the standard algorithm using a rectangular lattice.
Suppose that the decimation matrix N is diagonaland let each diagonal element be written in terms of its prime factors:where m is the total number of distinct prime factors present in the .
The CRT may be used to turn each 1D transform along dimension i into a multidimensional transform with a separate `pseudodimension' for each distinct prime factor of ; the number , of these pseudodimensions is equal to the cardinality of the set:The full ndimensional transform thus becomes μdimensional, with .
We may now permute the μ pseudodimensions so as to bring into contiguous position those corresponding to the same prime factor ; the m resulting groups of pseudodimensions are said to define `pprimary' blocks. The initial transform is now written as a tensor product of m pprimary transforms, where transform j is onpoints [by convention, dimension i is not transformed if ]. These pprimary transforms may be computed, for instance, by multidimensional Cooley–Tukey factorization (Section 1.3.3.3.1), which is faster than the straightforward row–column method. The final results may then be obtained by reversing all the permutations used.
The extra gain with respect to the multidimensional Cooley–Tukey method is that there are no twiddle factors between pprimary pieces corresponding to different primes p.
The case where N is not diagonal has been examined by Guessoum & Mersereau (1986).
Suppose that the CRT has been used as above to map an ndimensional DFT to a μdimensional DFT. For each [κ runs over those pairs (i, j) such that ], the Rader/Winograd procedure may be applied to put the matrix of the κth 1D DFT in the CBA normal form of a Winograd small FFT. The full DFT matrix may then be written, up to permutation of data and results, as
A well known property of the tensor product of matrices allows this to be rewritten asand thus to form a matrix in which the combined preaddition, multiplication and postaddition matrices have been precomputed. This procedure, called nesting, can be shown to afford a reduction of the arithmetic operation count compared to the row–column method (Morris, 1978).
Clearly, the nesting rearrangement need not be applied to all μ dimensions, but can be restricted to any desired subset of them.
Nussbaumer's approach views the DFT as the evaluation of certain polynomials constructed from the data (as in Section 1.3.3.2.4). For instance, putting , the 1D Npoint DFTmay be writtenwhere the polynomial Q is defined by
Let us consider (Nussbaumer & Quandalle, 1979) a 2D transform of size :By introduction of the polynomialsthis may be rewritten:
Let us now suppose that is coprime to N. Then has a unique inverse modulo N (denoted by ), so that multiplication by simply permutes the elements of and hencefor any function f over . We may thus write:whereSince only the value of polynomial at is involved in the result, the computation of may be carried out modulo the unique cyclotomic polynomial such that . Thus, if we define:we may write:or equivalently
For N an odd prime p, all nonzero values of are coprime with p so that the point DFT may be calculated as follows:
Step (1) is a set of p `polynomial transforms' involving no multiplications; step (2) consists of p DFTs on p points each since ifthenstep (3) is a permutation; and step (4) is a ppoint DFT. Thus the 2D DFT on points, which takes 2p ppoint DFTs by the row–column method, involves only ppoint DFTs; the other DFTs have been replaced by polynomial transforms involving only additions.
This procedure can be extended to n dimensions, and reduces the number of 1D ppoint DFTs from for the row–column method to , at the cost of introducing extra additions in the polynomial transforms.
A similar algorithm has been formulated by Auslander et al. (1983) in terms of Galois theory.
The mathematical analysis of the structure of DFT computations has brought to light a broad variety of possibilities for reducing or reshaping their arithmetic complexity. All of them are `analytic' in that they break down large transforms into a succession of smaller ones.
These results may now be considered from the converse `synthetic' viewpoint as providing a list of procedures for assembling them:
The simplest DFT may then be carried out into a global algorithm in many different ways. The diagrams in Fig. 1.3.3.1 illustrate a few of the options available to compute a 400point DFT. They may differ greatly in their arithmetic operation counts.
To obtain a truly useful measure of the computational complexity of a DFT algorithm, its arithmetic operation count must be tempered by computer architecture considerations. Three main types of tradeoffs must be borne in mind:
Many of the mathematical developments above took place in the context of singleprocessor serial computers, where f.p. addition is substantially cheaper than f.p. multiplication but where integer address arithmetic has to compete with f.p. arithmetic for processor cycles. As a result, the alternatives to the Cooley–Tukey algorithm hardly ever led to particularly favourable tradeoffs, thus creating the impression that there was little to gain by switching to more exotic algorithms.
The advent of new machine architectures with vector and/or parallel processing features has greatly altered this picture (Pease, 1968; Korn & Lambiotte, 1979; Fornberg, 1981; Swartzrauber, 1984):
Another major consideration is that of data flow [see e.g. Nawab & McClellan (1979)]. Serial machines only have few registers and few paths connecting them, and allow little or no overlap between computation and data movement. New architectures, on the other hand, comprise banks of vector registers (or `cache memory') besides the usual internal registers, and dedicated ALUs can service data transfers between several of them simultaneously and concurrently with computation.
In this new context, the devices described in Sections 1.3.3.2 and 1.3.3.3 for altering the balance between the various types of arithmetic operations, and reshaping the data flow during the computation, are invaluable. The field of machinedependent DFT algorithm design is thriving on them [see e.g. Temperton (1983a,b,c, 1985); Agarwal & Cooley (1986, 1987)].
In order to explore systematically all possible algorithms for carrying out a given DFT computation, and to pick the one best suited to a given machine, attempts have been made to develop:
Task (i) can be accomplished by systematic use of a tensor product notation to represent the various stages into which the DFT can be factored (reindexing, small transforms on subsets of indices, twiddle factors, digitreversal permutations).
Task (ii) may for instance use the Winograd CBA normal form for each small transform, then apply the rules governing the rearrangement of tensor product and ordinary product × operations on matrices. The matching of these rearrangements to the architecture of a vector and/or parallel computer can be formalized algebraically [see e.g. Chapter 2 of Tolimieri et al. (1989)].
Task (iii) is a complex search which requires techniques such as dynamic programming (Bellman, 1958).
Johnson & Burrus (1983) have proposed and tested such a scheme to identify the optimal tradeoffs between prime factor nesting and Winograd nesting of small Winograd transforms. In step (ii), they further decomposed the preaddition matrix A and postaddition matrix C into several factors, so that the number of design options available becomes very large: the Npoint DFT when N has four factors can be calculated in over 10^{12} distinct ways.
This large family of nested algorithms contains the prime factor algorithm and the Winograd algorithms as particular cases, but usually achieves greater efficiency than either by reducing the f.p. multiplication count while keeping the number of f.p. additions small.
There is little doubt that this systematic approach will be extended so as to incorporate all available methods of restructuring the DFT.
References
Agarwal, R. C. & Cooley, J. W. (1986). Fourier transform and convolution subroutines for the IBM 3090 Vector facility. IBM J. Res. Dev. 30, 145–162.Agarwal, R. C. & Cooley, J. W. (1987). Vectorized mixed radix discrete Fourier transform algorithms. Proc. IEEE, 75, 1283–1292.
Auslander, L., Feig, E. & Winograd, S. (1983). New algorithms for the multidimensional discrete Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 31, 388–403.
Auslander, L., Tolimieri, R. & Winograd, S. (1982). Hecke's theorem in quadratic reciprocity, finite nilpotent groups and the Cooley–Tukey algorithm. Adv. Math. 43, 122–172.
Bellman, R. (1958). Dynamic programming and stochastic control processes. Inf. Control, 1, 228–239.
Burrus, C. S. & Eschenbacher, P. W. (1981). An inplace, inorder prime factor FFT algorithm. IEEE Trans. Acoust. Speech Signal Process. 29, 806–817.
Fornberg, A. (1981). A vector implementation of the fast Fourier transform algorithm. Math. Comput. 36, 189–191.
Guessoum, A. & Mersereau, R. M. (1986). Fast algorithms for the multidimensional discrete Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 34, 937–943.
Harris, D. B., McClellan, J. H., Chan, D. S. K. & Schuessler, H. W. (1977). Vector radix fast Fourier transform. Rec. 1977 IEEE Internat. Conf. Acoust. Speech Signal Process. pp. 548–551.
Johnson, H. W. & Burrus, C. S. (1983). The design of optimal DFT algorithms using dynamic programming. IEEE Trans. Acoust. Speech Signal Process. 31, 378–387.
Kolba, D. P. & Parks, T. W. (1977). A prime factor FFT algorithm using highspeed convolution. IEEE Trans. Acoust. Speech Signal Process. 25, 281–294.
Korn, D. G. & Lambiotte, J. J. Jr (1979). Computing the fast Fourier transform on a vector computer. Math. Comput. 33, 977–992.
Mersereau, R. & Speake, T. C. (1981). A unified treatment of Cooley–Tukey algorithms for the evaluation of the multidimensional DFT. IEEE Trans. Acoust. Speech Signal Process. 29, 1011–1018.
Mersereau, R. M. (1979). The processing of hexagonally sampled twodimensional signals. Proc. IEEE, 67, 930–949.
Morris, R. L. (1978). A comparative study of time efficient FFT and WFTA programs for general purpose computers. IEEE Trans. Acoust. Speech Signal Process. 26, 141–150.
Nawab, H. & McClellan, J. H. (1979). Bounds on the minimum number of data transfers in WFTA and FFT programs. IEEE Trans. Acoust. Speech Signal Process. 27, 393–398.
Nussbaumer, H. J. & Quandalle, P. (1979). Fast computation of discrete Fourier transforms using polynomial transforms. IEEE Trans. Acoust. Speech Signal Process. 27, 169–181.
Pease, M. C. (1968). An adaptation of the fast Fourier transform for parallel processing. J. Assoc. Comput. Mach. 15, 252–264.
Rivard, G. E. (1977). Direct fast Fourier transform of bivariate functions. IEEE Trans. Acoust. Speech Signal Process. 25, 250–252.
Silverman, H. F. (1977). An introduction to programming the Winograd Fourier transform algorithm (WFTA). IEEE Trans. Acoust. Speech Signal Process. 25, 152–165.
Swartzrauber, P. N. (1984). FFT algorithms for vector computers. Parallel Comput. 1, 45–63.
Temperton, C. (1983a). Selfsorting mixedradix fast Fourier transforms. J. Comput. Phys. 52, 1–23.
Temperton, C. (1983b). A note on the prime factor FFT algorithm. J. Comput. Phys. 52, 198–204.
Temperton, C. (1983c). Fast mixedradix real Fourier transforms. J. Comput. Phys. 52, 340–350.
Temperton, C. (1985). Implementation of a selfsorting in place prime factor FFT algorithm. J. Comput. Phys. 58, 283–299.
Tolimieri, R., An, M. & Lu, C. (1989). Algorithms for Discrete Fourier Transform and Convolution. New York: SpringerVerlag.
Uhrich, M. L. (1969). Fast Fourier transforms without sorting. IEEE Trans. AU, 17, 170–172.