International
Tables for
Crystallography
Volume B
Reciprocal space
Edited by U. Shmueli

International Tables for Crystallography (2006). Vol. B, ch. 1.3, p. 58   | 1 | 2 |

Section 1.3.3.3.3.2. Computer architecture considerations

G. Bricognea

aMRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, England, and LURE, Bâtiment 209D, Université Paris-Sud, 91405 Orsay, France

1.3.3.3.3.2. Computer architecture considerations

| top | pdf |

To obtain a truly useful measure of the computational complexity of a DFT algorithm, its arithmetic operation count must be tempered by computer architecture considerations. Three main types of trade-offs must be borne in mind:

  • (i) reductions in floating-point (f.p.) arithmetic count are obtained by reindexing, hence at the cost of an increase in integer arithmetic on addresses, although some shortcuts may be found (Uhrich, 1969[link]; Burrus & Eschenbacher, 1981[link]);

  • (ii) reduction in the f.p. multiplication count usually leads to a large increase in the f.p. addition count (Morris, 1978[link]);

  • (iii) nesting can increase execution speed, but causes a loss of modularity and hence complicates program development (Silverman, 1977[link]; Kolba & Parks, 1977[link]).

Many of the mathematical developments above took place in the context of single-processor serial computers, where f.p. addition is substantially cheaper than f.p. multiplication but where integer address arithmetic has to compete with f.p. arithmetic for processor cycles. As a result, the alternatives to the Cooley–Tukey algorithm hardly ever led to particularly favourable trade-offs, thus creating the impression that there was little to gain by switching to more exotic algorithms.

The advent of new machine architectures with vector and/or parallel processing features has greatly altered this picture (Pease, 1968[link]; Korn & Lambiotte, 1979[link]; Fornberg, 1981[link]; Swartzrauber, 1984[link]):

  • (i) pipelining equalizes the cost of f.p. addition and f.p. multiplication, and the ideal `blend' of the two types of operations depends solely on the number of adder and multiplier units available in each machine;

  • (ii) integer address arithmetic is delegated to specialized arithmetic and logical units (ALUs) operating concurrently with the f.p. units, so that complex reindexing schemes may be used without loss of overall efficiency.

Another major consideration is that of data flow [see e.g. Nawab & McClellan (1979)[link]]. Serial machines only have few registers and few paths connecting them, and allow little or no overlap between computation and data movement. New architectures, on the other hand, comprise banks of vector registers (or `cache memory') besides the usual internal registers, and dedicated ALUs can service data transfers between several of them simultaneously and concurrently with computation.

In this new context, the devices described in Sections 1.3.3.2[link] and 1.3.3.3[link] for altering the balance between the various types of arithmetic operations, and reshaping the data flow during the computation, are invaluable. The field of machine-dependent DFT algorithm design is thriving on them [see e.g. Temperton (1983a[link],b[link],c[link], 1985[link]); Agarwal & Cooley (1986[link], 1987[link])].

References

Agarwal, R. C. & Cooley, J. W. (1986). Fourier transform and convolution subroutines for the IBM 3090 Vector facility. IBM J. Res. Dev. 30, 145–162.
Agarwal, R. C. & Cooley, J. W. (1987). Vectorized mixed radix discrete Fourier transform algorithms. Proc. IEEE, 75, 1283–1292.
Burrus, C. S. & Eschenbacher, P. W. (1981). An in-place, in-order prime factor FFT algorithm. IEEE Trans. Acoust. Speech Signal Process. 29, 806–817.
Fornberg, A. (1981). A vector implementation of the fast Fourier transform algorithm. Math. Comput. 36, 189–191.
Kolba, D. P. & Parks, T. W. (1977). A prime factor FFT algorithm using high-speed convolution. IEEE Trans. Acoust. Speech Signal Process. 25, 281–294.
Korn, D. G. & Lambiotte, J. J. Jr (1979). Computing the fast Fourier transform on a vector computer. Math. Comput. 33, 977–992.
Morris, R. L. (1978). A comparative study of time efficient FFT and WFTA programs for general purpose computers. IEEE Trans. Acoust. Speech Signal Process. 26, 141–150.
Nawab, H. & McClellan, J. H. (1979). Bounds on the minimum number of data transfers in WFTA and FFT programs. IEEE Trans. Acoust. Speech Signal Process. 27, 393–398.
Pease, M. C. (1968). An adaptation of the fast Fourier transform for parallel processing. J. Assoc. Comput. Mach. 15, 252–264.
Silverman, H. F. (1977). An introduction to programming the Winograd Fourier transform algorithm (WFTA). IEEE Trans. Acoust. Speech Signal Process. 25, 152–165.
Swartzrauber, P. N. (1984). FFT algorithms for vector computers. Parallel Comput. 1, 45–63.
Temperton, C. (1983a). Self-sorting mixed-radix fast Fourier transforms. J. Comput. Phys. 52, 1–23.
Temperton, C. (1983b). A note on the prime factor FFT algorithm. J. Comput. Phys. 52, 198–204.
Temperton, C. (1983c). Fast mixed-radix real Fourier transforms. J. Comput. Phys. 52, 340–350.
Temperton, C. (1985). Implementation of a self-sorting in place prime factor FFT algorithm. J. Comput. Phys. 58, 283–299.
Uhrich, M. L. (1969). Fast Fourier transforms without sorting. IEEE Trans. AU, 17, 170–172.








































to end of page
to top of page