International
Tables for Crystallography Volume B Reciprocal space Edited by U. Shmueli © International Union of Crystallography 2010 
International Tables for Crystallography (2010). Vol. B, ch. 1.3, pp. 6162

The mathematical analysis of the structure of DFT computations has brought to light a broad variety of possibilities for reducing or reshaping their arithmetic complexity. All of them are `analytic' in that they break down large transforms into a succession of smaller ones.
These results may now be considered from the converse `synthetic' viewpoint as providing a list of procedures for assembling them:
The simplest DFT may then be carried out into a global algorithm in many different ways. The diagrams in Fig. 1.3.3.1 illustrate a few of the options available to compute a 400point DFT. They may differ greatly in their arithmetic operation counts.
To obtain a truly useful measure of the computational complexity of a DFT algorithm, its arithmetic operation count must be tempered by computer architecture considerations. Three main types of tradeoffs must be borne in mind:
Many of the mathematical developments above took place in the context of singleprocessor serial computers, where f.p. addition is substantially cheaper than f.p. multiplication but where integer address arithmetic has to compete with f.p. arithmetic for processor cycles. As a result, the alternatives to the Cooley–Tukey algorithm hardly ever led to particularly favourable tradeoffs, thus creating the impression that there was little to gain by switching to more exotic algorithms.
The advent of new machine architectures with vector and/or parallel processing features has greatly altered this picture (Pease, 1968; Korn & Lambiotte, 1979; Fornberg, 1981; Swartzrauber, 1984):
Another major consideration is that of data flow [see e.g. Nawab & McClellan (1979)]. Serial machines only have few registers and few paths connecting them, and allow little or no overlap between computation and data movement. New architectures, on the other hand, comprise banks of vector registers (or `cache memory') besides the usual internal registers, and dedicated ALUs can service data transfers between several of them simultaneously and concurrently with computation.
In this new context, the devices described in Sections 1.3.3.2 and 1.3.3.3 for altering the balance between the various types of arithmetic operations, and reshaping the data flow during the computation, are invaluable. The field of machinedependent DFT algorithm design is thriving on them [see e.g. Temperton (1983a,b,c, 1985); Agarwal & Cooley (1986, 1987)].
In order to explore systematically all possible algorithms for carrying out a given DFT computation, and to pick the one best suited to a given machine, attempts have been made to develop:
Task (i) can be accomplished by systematic use of a tensor product notation to represent the various stages into which the DFT can be factored (reindexing, small transforms on subsets of indices, twiddle factors, digitreversal permutations).
Task (ii) may for instance use the Winograd CBA normal form for each small transform, then apply the rules governing the rearrangement of tensor product and ordinary product × operations on matrices. The matching of these rearrangements to the architecture of a vector and/or parallel computer can be formalized algebraically [see e.g. Chapter 2 of Tolimieri et al. (1989)].
Task (iii) is a complex search which requires techniques such as dynamic programming (Bellman, 1958).
Johnson & Burrus (1983) have proposed and tested such a scheme to identify the optimal tradeoffs between prime factor nesting and Winograd nesting of small Winograd transforms. In step (ii), they further decomposed the preaddition matrix A and postaddition matrix C into several factors, so that the number of design options available becomes very large: the Npoint DFT when N has four factors can be calculated in over 10^{12} distinct ways.
This large family of nested algorithms contains the prime factor algorithm and the Winograd algorithms as particular cases, but usually achieves greater efficiency than either by reducing the f.p. multiplication count while keeping the number of f.p. additions small.
There is little doubt that this systematic approach will be extended so as to incorporate all available methods of restructuring the DFT.
References
Agarwal, R. C. & Cooley, J. W. (1986). Fourier transform and convolution subroutines for the IBM 3090 Vector facility. IBM J. Res. Dev. 30, 145–162.Agarwal, R. C. & Cooley, J. W. (1987). Vectorized mixed radix discrete Fourier transform algorithms. Proc. IEEE, 75, 1283–1292.
Bellman, R. (1958). Dynamic programming and stochastic control processes. Inf. Control, 1, 228–239.
Burrus, C. S. & Eschenbacher, P. W. (1981). An inplace, inorder prime factor FFT algorithm. IEEE Trans. Acoust. Speech Signal Process. 29, 806–817.
Fornberg, A. (1981). A vector implementation of the fast Fourier transform algorithm. Math. Comput. 36, 189–191.
Johnson, H. W. & Burrus, C. S. (1983). The design of optimal DFT algorithms using dynamic programming. IEEE Trans. Acoust. Speech Signal Process. 31, 378–387.
Kolba, D. P. & Parks, T. W. (1977). A prime factor FFT algorithm using highspeed convolution. IEEE Trans. Acoust. Speech Signal Process. 25, 281–294.
Korn, D. G. & Lambiotte, J. J. Jr (1979). Computing the fast Fourier transform on a vector computer. Math. Comput. 33, 977–992.
Morris, R. L. (1978). A comparative study of time efficient FFT and WFTA programs for general purpose computers. IEEE Trans. Acoust. Speech Signal Process. 26, 141–150.
Nawab, H. & McClellan, J. H. (1979). Bounds on the minimum number of data transfers in WFTA and FFT programs. IEEE Trans. Acoust. Speech Signal Process. 27, 393–398.
Pease, M. C. (1968). An adaptation of the fast Fourier transform for parallel processing. J. Assoc. Comput. Mach. 15, 252–264.
Silverman, H. F. (1977). An introduction to programming the Winograd Fourier transform algorithm (WFTA). IEEE Trans. Acoust. Speech Signal Process. 25, 152–165.
Swartzrauber, P. N. (1984). FFT algorithms for vector computers. Parallel Comput. 1, 45–63.
Temperton, C. (1983a). Selfsorting mixedradix fast Fourier transforms. J. Comput. Phys. 52, 1–23.
Temperton, C. (1983b). A note on the prime factor FFT algorithm. J. Comput. Phys. 52, 198–204.
Temperton, C. (1983c). Fast mixedradix real Fourier transforms. J. Comput. Phys. 52, 340–350.
Temperton, C. (1985). Implementation of a selfsorting in place prime factor FFT algorithm. J. Comput. Phys. 58, 283–299.
Tolimieri, R., An, M. & Lu, C. (1989). Algorithms for Discrete Fourier Transform and Convolution. New York: SpringerVerlag.
Uhrich, M. L. (1969). Fast Fourier transforms without sorting. IEEE Trans. AU, 17, 170–172.