International
Tables for
Crystallography
Volume B
Reciprocal space
Edited by U. Shmueli

International Tables for Crystallography (2010). Vol. B, ch. 1.3, pp. 58-62   | 1 | 2 |

## Section 1.3.3.3. Multidimensional algorithms

G. Bricognea

aGlobal Phasing Ltd, Sheraton House, Suites 14–16, Castle Park, Cambridge CB3 0AX, England, and LURE, Bâtiment 209D, Université Paris-Sud, 91405 Orsay, France

#### 1.3.3.3. Multidimensional algorithms

| top | pdf |

From an algorithmic point of view, the distinction between one-dimensional (1D) and multidimensional DFTs is somewhat blurred by the fact that some factoring techniques turn a 1D transform into a multidimensional one. The distinction made here, however, is a practical one and is based on the dimensionality of the indexing sets for data and results. This section will therefore be concerned with the problem of factoring the DFT when the indexing sets for the input data and output results are multidimensional.

#### 1.3.3.3.1. The method of successive one-dimensional transforms

| top | pdf |

The DFT was defined in Section 1.3.2.7.4 in an n-dimensional setting and it was shown that when the decimation matrix N is diagonal, say , then has a tensor product structure:This may be rewritten as follows:where the I's are identity matrices and × denotes ordinary matrix multiplication. The matrix within each bracket represents a one-dimensional DFT along one of the n dimensions, the other dimensions being left untransformed. As these matrices commute, the order in which the successive 1D DFTs are performed is immaterial.

This is the most straightforward method for building an n-dimensional algorithm from existing 1D algorithms. It is known in crystallography under the name of Beevers–Lipson factorization' (Section 1.3.4.3.1), and in signal processing as the row–column method'.

#### 1.3.3.3.2. Multidimensional factorization

| top | pdf |

Substantial reductions in the arithmetic cost, as well as gains in flexibility, can be obtained if the factoring of the DFT is carried out in several dimensions simultaneously. The presentation given here is a generalization of that of Mersereau & Speake (1981), using the abstract setting established independently by Auslander, Tolimieri & Winograd (1982).

Let us return to the general n-dimensional setting of Section 1.3.2.7.4, where the DFT was defined for an arbitrary decimation matrix N by the formulae (where denotes ):with

#### 1.3.3.3.2.1. Multidimensional Cooley–Tukey factorization

| top | pdf |

Let us now assume that this decimation can be factored into d successive decimations, i.e. thatand henceThen the coset decomposition formulae corresponding to these successive decimations (Section 1.3.2.7.1) can be combined as follows:with . Therefore, any may be written uniquely asSimilarly:so that any may be written uniquely aswith . These decompositions are the vector analogues of the multi-radix number representation systems used in the Cooley–Tukey factorization.

We may then write the definition of with factors asThe argument of e(–) may be expanded asThe first summand may be recognized as a twiddle factor, the second and third as the kernels of and , respectively, while the fourth is an integer which may be dropped. We are thus led to a vector-radix' version of the Cooley–Tukey algorithm, in which the successive decimations may be introduced in all n dimensions simultaneously by general integer matrices. The computation may be decomposed into five stages analogous to those of the one-dimensional algorithm of Section 1.3.3.2.1:

 (i) form the vectors of shape by (ii) calculate the transforms on points: (iii) form the vectors of shape by (iv) calculate the transforms on points: (v) collect as .

The initial -point transform can thus be performed as transforms on points, followed by transforms on points. This process can be applied successively to all d factors. The same decomposition applies to , up to the complex conjugation of twiddle factors, the normalization factor being obtained as the product of the factors in the successive partial transforms .

The geometric interpretation of this factorization in terms of partial transforms on translates of sublattices applies in full to this n-dimensional setting; in particular, the twiddle factors are seen to be related to the residual translations which place the sublattices in register within the big lattice. If the intermediate transforms are performed in place, then the quantitywill eventually be found at locationso that the final results will have to be unscrambled by a process which may be called coset reversal', the vector equivalent of digit reversal.

Factoring by 2 in all n dimensions simultaneously, i.e. taking , leads to n-dimensional butterflies'. Decimation in time corresponds to the choice , so that is an n-dimensional parity class; the calculation then proceeds byDecimation in frequency corresponds to the choice , , so that labels octant' blocks of shape M; the calculation then proceeds through the following steps:i.e. the parity classes of results, corresponding to the different , are obtained separately. When the dimension n is 2 and the decimating matrix is diagonal, this analysis reduces to the vector radix FFT' algorithms proposed by Rivard (1977) and Harris et al. (1977). These lead to substantial reductions in the number M of multiplications compared to the row–column method: M is reduced to by simultaneous factoring, and to by simultaneous factoring.

The use of a nondiagonal decimating matrix may bring savings in computing time if the spectrum of the band-limited function under study is of such a shape as to pack more compactly in a nonrectangular than in a rectangular lattice (Mersereau, 1979). If, for instance, the support K of the spectrum Φ is contained in a sphere, then a decimation matrix producing a close packing of these spheres will yield an aliasing-free DFT algorithm with fewer sample points than the standard algorithm using a rectangular lattice.

#### 1.3.3.3.2.2. Multidimensional prime factor algorithm

| top | pdf |

Suppose that the decimation matrix N is diagonaland let each diagonal element be written in terms of its prime factors:where m is the total number of distinct prime factors present in the .

The CRT may be used to turn each 1D transform along dimension i into a multidimensional transform with a separate pseudo-dimension' for each distinct prime factor of ; the number , of these pseudo-dimensions is equal to the cardinality of the set:The full n-dimensional transform thus becomes μ-dimensional, with .

We may now permute the μ pseudo-dimensions so as to bring into contiguous position those corresponding to the same prime factor ; the m resulting groups of pseudo-dimensions are said to define p-primary' blocks. The initial transform is now written as a tensor product of m p-primary transforms, where transform j is onpoints [by convention, dimension i is not transformed if ]. These p-primary transforms may be computed, for instance, by multidimensional Cooley–Tukey factorization (Section 1.3.3.3.1), which is faster than the straightforward row–column method. The final results may then be obtained by reversing all the permutations used.

The extra gain with respect to the multidimensional Cooley–Tukey method is that there are no twiddle factors between p-primary pieces corresponding to different primes p.

The case where N is not diagonal has been examined by Guessoum & Mersereau (1986).

#### 1.3.3.3.2.3. Nesting of Winograd small FFTs

| top | pdf |

Suppose that the CRT has been used as above to map an n-dimensional DFT to a μ-dimensional DFT. For each [κ runs over those pairs (i, j) such that ], the Rader/Winograd procedure may be applied to put the matrix of the κth 1D DFT in the CBA normal form of a Winograd small FFT. The full DFT matrix may then be written, up to permutation of data and results, as

A well known property of the tensor product of matrices allows this to be rewritten asand thus to form a matrix in which the combined pre-addition, multiplication and post-addition matrices have been precomputed. This procedure, called nesting, can be shown to afford a reduction of the arithmetic operation count compared to the row–column method (Morris, 1978).

Clearly, the nesting rearrangement need not be applied to all μ dimensions, but can be restricted to any desired subset of them.

#### 1.3.3.3.2.4. The Nussbaumer–Quandalle algorithm

| top | pdf |

Nussbaumer's approach views the DFT as the evaluation of certain polynomials constructed from the data (as in Section 1.3.3.2.4). For instance, putting , the 1D N-point DFTmay be writtenwhere the polynomial Q is defined by

Let us consider (Nussbaumer & Quandalle, 1979) a 2D transform of size :By introduction of the polynomialsthis may be rewritten:

Let us now suppose that is coprime to N. Then has a unique inverse modulo N (denoted by ), so that multiplication by simply permutes the elements of and hencefor any function f over . We may thus write:whereSince only the value of polynomial at is involved in the result, the computation of may be carried out modulo the unique cyclotomic polynomial such that . Thus, if we define:we may write:or equivalently

For N an odd prime p, all nonzero values of are coprime with p so that the -point DFT may be calculated as follows:

 (1) form the polynomialsfor ; (2) evaluate for ; (3) put ; (4) calculate the terms for separately by

Step (1) is a set of p polynomial transforms' involving no multiplications; step (2) consists of p DFTs on p points each since ifthenstep (3) is a permutation; and step (4) is a p-point DFT. Thus the 2D DFT on points, which takes 2p p-point DFTs by the row–column method, involves only p-point DFTs; the other DFTs have been replaced by polynomial transforms involving only additions.

This procedure can be extended to n dimensions, and reduces the number of 1D p-point DFTs from for the row–column method to , at the cost of introducing extra additions in the polynomial transforms.

A similar algorithm has been formulated by Auslander et al. (1983) in terms of Galois theory.

| top | pdf |

#### 1.3.3.3.3.1. From local pieces to global algorithms

| top | pdf |

The mathematical analysis of the structure of DFT computations has brought to light a broad variety of possibilities for reducing or reshaping their arithmetic complexity. All of them are analytic' in that they break down large transforms into a succession of smaller ones.

These results may now be considered from the converse synthetic' viewpoint as providing a list of procedures for assembling them:

 (i) the building blocks are one-dimensional p-point algorithms for p a small prime; (ii) the low-level connectors are the multiplicative reindexing methods of Rader and Winograd, or the polynomial transform reindexing method of Nussbaumer and Quandalle, which allow the construction of efficient algorithms for larger primes p, for prime powers , and for p-primary pieces of shape ; (iii) the high-level connectors are the additive reindexing scheme of Cooley–Tukey, the Chinese remainder theorem reindexing, and the tensor product construction; (iv) nesting may be viewed as the glue' which seals all elements.

The simplest DFT may then be carried out into a global algorithm in many different ways. The diagrams in Fig. 1.3.3.1 illustrate a few of the options available to compute a 400-point DFT. They may differ greatly in their arithmetic operation counts.

 Figure 1.3.3.1 | top | pdf |A few global algorithms for computing a 400-point DFT. CT: Cooley–Tukey factorization. PF: prime factor (or Good) factorization. W: Winograd algorithm.

#### 1.3.3.3.3.2. Computer architecture considerations

| top | pdf |

To obtain a truly useful measure of the computational complexity of a DFT algorithm, its arithmetic operation count must be tempered by computer architecture considerations. Three main types of trade-offs must be borne in mind:

 (i) reductions in floating-point (f.p.) arithmetic count are obtained by reindexing, hence at the cost of an increase in integer arithmetic on addresses, although some shortcuts may be found (Uhrich, 1969; Burrus & Eschenbacher, 1981); (ii) reduction in the f.p. multiplication count usually leads to a large increase in the f.p. addition count (Morris, 1978); (iii) nesting can increase execution speed, but causes a loss of modularity and hence complicates program development (Silverman, 1977; Kolba & Parks, 1977).

Many of the mathematical developments above took place in the context of single-processor serial computers, where f.p. addition is substantially cheaper than f.p. multiplication but where integer address arithmetic has to compete with f.p. arithmetic for processor cycles. As a result, the alternatives to the Cooley–Tukey algorithm hardly ever led to particularly favourable trade-offs, thus creating the impression that there was little to gain by switching to more exotic algorithms.

The advent of new machine architectures with vector and/or parallel processing features has greatly altered this picture (Pease, 1968; Korn & Lambiotte, 1979; Fornberg, 1981; Swartzrauber, 1984):

 (i) pipelining equalizes the cost of f.p. addition and f.p. multiplication, and the ideal blend' of the two types of operations depends solely on the number of adder and multiplier units available in each machine; (ii) integer address arithmetic is delegated to specialized arithmetic and logical units (ALUs) operating concurrently with the f.p. units, so that complex reindexing schemes may be used without loss of overall efficiency.

Another major consideration is that of data flow [see e.g. Nawab & McClellan (1979)]. Serial machines only have few registers and few paths connecting them, and allow little or no overlap between computation and data movement. New architectures, on the other hand, comprise banks of vector registers (or `cache memory') besides the usual internal registers, and dedicated ALUs can service data transfers between several of them simultaneously and concurrently with computation.

In this new context, the devices described in Sections 1.3.3.2 and 1.3.3.3 for altering the balance between the various types of arithmetic operations, and reshaping the data flow during the computation, are invaluable. The field of machine-dependent DFT algorithm design is thriving on them [see e.g. Temperton (1983a,b,c, 1985); Agarwal & Cooley (1986, 1987)].

#### 1.3.3.3.3.3. The Johnson–Burrus family of algorithms

| top | pdf |

In order to explore systematically all possible algorithms for carrying out a given DFT computation, and to pick the one best suited to a given machine, attempts have been made to develop:

 (i) a high-level notation of describing all the ingredients of a DFT computation, including data permutation and data flow; (ii) a formal calculus capable of operating on these descriptions so as to represent all possible reorganizations of the computation; (iii) an automatic procedure for evaluating the performance of a given algorithm on a specific architecture.

Task (i) can be accomplished by systematic use of a tensor product notation to represent the various stages into which the DFT can be factored (reindexing, small transforms on subsets of indices, twiddle factors, digit-reversal permutations).

Task (ii) may for instance use the Winograd CBA normal form for each small transform, then apply the rules governing the rearrangement of tensor product and ordinary product × operations on matrices. The matching of these rearrangements to the architecture of a vector and/or parallel computer can be formalized algebraically [see e.g. Chapter 2 of Tolimieri et al. (1989)].

Task (iii) is a complex search which requires techniques such as dynamic programming (Bellman, 1958).

Johnson & Burrus (1983) have proposed and tested such a scheme to identify the optimal trade-offs between prime factor nesting and Winograd nesting of small Winograd transforms. In step (ii), they further decomposed the pre-addition matrix A and post-addition matrix C into several factors, so that the number of design options available becomes very large: the N-point DFT when N has four factors can be calculated in over 1012 distinct ways.

This large family of nested algorithms contains the prime factor algorithm and the Winograd algorithms as particular cases, but usually achieves greater efficiency than either by reducing the f.p. multiplication count while keeping the number of f.p. additions small.

There is little doubt that this systematic approach will be extended so as to incorporate all available methods of restructuring the DFT.

### References

Agarwal, R. C. & Cooley, J. W. (1986). Fourier transform and convolution subroutines for the IBM 3090 Vector facility. IBM J. Res. Dev. 30, 145–162.
Agarwal, R. C. & Cooley, J. W. (1987). Vectorized mixed radix discrete Fourier transform algorithms. Proc. IEEE, 75, 1283–1292.
Auslander, L., Feig, E. & Winograd, S. (1983). New algorithms for the multidimensional discrete Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 31, 388–403.
Auslander, L., Tolimieri, R. & Winograd, S. (1982). Hecke's theorem in quadratic reciprocity, finite nilpotent groups and the Cooley–Tukey algorithm. Adv. Math. 43, 122–172.
Bellman, R. (1958). Dynamic programming and stochastic control processes. Inf. Control, 1, 228–239.
Burrus, C. S. & Eschenbacher, P. W. (1981). An in-place, in-order prime factor FFT algorithm. IEEE Trans. Acoust. Speech Signal Process. 29, 806–817.
Fornberg, A. (1981). A vector implementation of the fast Fourier transform algorithm. Math. Comput. 36, 189–191.
Guessoum, A. & Mersereau, R. M. (1986). Fast algorithms for the multidimensional discrete Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 34, 937–943.
Harris, D. B., McClellan, J. H., Chan, D. S. K. & Schuessler, H. W. (1977). Vector radix fast Fourier transform. Rec. 1977 IEEE Internat. Conf. Acoust. Speech Signal Process. pp. 548–551.
Johnson, H. W. & Burrus, C. S. (1983). The design of optimal DFT algorithms using dynamic programming. IEEE Trans. Acoust. Speech Signal Process. 31, 378–387.
Kolba, D. P. & Parks, T. W. (1977). A prime factor FFT algorithm using high-speed convolution. IEEE Trans. Acoust. Speech Signal Process. 25, 281–294.
Korn, D. G. & Lambiotte, J. J. Jr (1979). Computing the fast Fourier transform on a vector computer. Math. Comput. 33, 977–992.
Mersereau, R. & Speake, T. C. (1981). A unified treatment of Cooley–Tukey algorithms for the evaluation of the multidimensional DFT. IEEE Trans. Acoust. Speech Signal Process. 29, 1011–1018.
Mersereau, R. M. (1979). The processing of hexagonally sampled two-dimensional signals. Proc. IEEE, 67, 930–949.
Morris, R. L. (1978). A comparative study of time efficient FFT and WFTA programs for general purpose computers. IEEE Trans. Acoust. Speech Signal Process. 26, 141–150.
Nawab, H. & McClellan, J. H. (1979). Bounds on the minimum number of data transfers in WFTA and FFT programs. IEEE Trans. Acoust. Speech Signal Process. 27, 393–398.
Nussbaumer, H. J. & Quandalle, P. (1979). Fast computation of discrete Fourier transforms using polynomial transforms. IEEE Trans. Acoust. Speech Signal Process. 27, 169–181.
Pease, M. C. (1968). An adaptation of the fast Fourier transform for parallel processing. J. Assoc. Comput. Mach. 15, 252–264.
Rivard, G. E. (1977). Direct fast Fourier transform of bivariate functions. IEEE Trans. Acoust. Speech Signal Process. 25, 250–252.
Silverman, H. F. (1977). An introduction to programming the Winograd Fourier transform algorithm (WFTA). IEEE Trans. Acoust. Speech Signal Process. 25, 152–165.
Swartzrauber, P. N. (1984). FFT algorithms for vector computers. Parallel Comput. 1, 45–63.
Temperton, C. (1983a). Self-sorting mixed-radix fast Fourier transforms. J. Comput. Phys. 52, 1–23.
Temperton, C. (1983b). A note on the prime factor FFT algorithm. J. Comput. Phys. 52, 198–204.
Temperton, C. (1983c). Fast mixed-radix real Fourier transforms. J. Comput. Phys. 52, 340–350.
Temperton, C. (1985). Implementation of a self-sorting in place prime factor FFT algorithm. J. Comput. Phys. 58, 283–299.
Tolimieri, R., An, M. & Lu, C. (1989). Algorithms for Discrete Fourier Transform and Convolution. New York: Springer-Verlag.
Uhrich, M. L. (1969). Fast Fourier transforms without sorting. IEEE Trans. AU, 17, 170–172.