Tables for
Volume C
Mathematical, physical and chemical tables
Edited by E. Prince

International Tables for Crystallography (2006). Vol. C, ch. 8.1, pp. 680-681

Section 8.1.2. Principles of least squares

E. Princea and P. T. Boggsb

aNIST Center for Neutron Research, National Institute of Standards and Technology, Gaithersburg, MD 20899, USA, and bScientific Computing Division, National Institute of Standards and Technology, Gaithersburg, MD 20899, USA

8.1.2. Principles of least squares

| top | pdf |

The method of least squares may be formulated as follows: Given a set of n observations, [y_i\,(i=1,2,\ldots, n)], that are measurements of quantities that can be described by differentiable model functions, [M_i({\bf x})], where x is a vector of parameters, [x_j (\,j=1,2,\ldots, p)], find the values of the parameters for which the sum [S=\textstyle\sum\limits _{i=1}^nw_i[y_i-M_i({\bf x})]^2 \eqno (]is minimum. Here, [w_i] represents a weight assigned to the ith observation. The values of the parameters that give the minimum value of S are called estimates of the parameters, and a function of the data that locates the minimum is an estimator. A necessary condition for S to be a minimum is for the gradient to vanish, which gives a set of simultaneous equations, the normal equations, of the form [{\partial S \over \partial x_j}=-2\sum _{i=1}^nw_i[y_i-M_i({\bf x})] \displaystyle{\partial M_i({\bf x}) \over \partial x_j}=0. \eqno (]The model functions, [M_i({\bf x})], are, in general, nonlinear, and there are no direct ways to solve these systems of equations. Iterative methods for solving them are discussed in Section 8.1.4[link]. Much of the analysis of results, however, is based on the assumption that linear approximations to the model functions are good approximations in the vicinity of the minimum, and we shall therefore begin with a discussion of linear least squares.

To express linear relationships, it is convenient to use matrix notation. Let M(x) and y be column vectors whose ith elements are Mi(x) and [y_{i}]. Similarly, let b be a vector and A be a matrix such that a linear approximation to the ith model function can be written [M_{i}({\bf x})=b_{i}+\textstyle\sum\limits_{j=1}^{p}A_{ij}x_{j}. \eqno (]Equations ([link] can be written, in matrix form, [{\bi M}({\bf x})={\bf b}+{\bi A}{\bf x}, \eqno (]and, for this linear model, ([link] becomes [S=[({\bf y}-{\bf b})-{\bi A}{\bf x}]^{T}{\bi W}[({\bf y}-{\bf b})-{\bi A}{\bf x}], \eqno (]where W is a diagonal matrix whose diagonal elements are [W_{ii}=w_{i}]. In this notation, the normal equations ([link] can be written [{\bi A}^{T}{\bi WA}{\bf x}={\bi A}^{T}{\bi W}({\bf y}-{\bf b}), \eqno (]and their solution is [\widehat {{\bf x}}=({\bi A}^{T}{\bi WA})^{-1}{\bi A}^{T}{\bi W}({\bf y}-{\bf b}). \eqno (]If [W_{ii}\, \gt \, 0] for all i, and A has full column rank, then ATWA will be positive definite, and S will have a unique minimum at [{\bf x} =\widehat {{\bf x}}]. The matrix [{\bi H}=({\bi A}^{T}{\bi WA})^{-1}{\bi A}^{ T}{\bi W}] is a p × n matrix that relates the n-dimensional observation space to the p-dimensional parameter space and is known as the least-squares estimator; because each element of [\widehat {{\bf x}}] is a linear function of the observations, it is a linear estimator. [Note that, in actual practice, the matrix H is not actually evaluated, except, possibly, in very small problems. Rather, the linear system ATWAx = ATW(yb) is solved using the methods of Section 8.1.3[link].]

The least-squares estimator has some special properties in statistical analysis. Suppose that the elements of y are experimental observations drawn at random from populations whose means are given by the model, M(x), for some unknown x, which we wish to estimate. This may be written [\left \langle {\bf y}-{\bf b}\right \rangle ={\bi A}{\bf x}. \eqno (]The expected value of the least-squares estimate is [\eqalignno{\left \langle \widehat {{\bf x}}\right \rangle &=\left \langle ({\bi A}^T{\bi WA})^{-1}{\bi A}^T{\bi W}({\bf y}-{\bf b})\right \rangle \cr &=({\bi A}^T{\bi WA})^{-1}{\bi A}^T{\bi W}\left \langle {\bf y}-{\bf b}\right \rangle \cr &=({\bi A}^T{\bi WA})^{-1}{\bi A}^T{\bi WA}{\bf x} \cr &={\bf x}. & (}]

If the expected value of an estimate is equal to the variable to be estimated, the estimator is said to be unbiased. Equation ([link] shows that the least-squares estimator is an unbiased estimator for x, independent of W, provided only that y is an unbiased estimate of M(x), the matrix ATWA is nonsingular, and the elements of W are constants independent of y and M(x). Let [{\bi V}_{{\bf x}}] and [{\bi V}_{{\bf y}}] be the variance–covariance matrices for the joint p.d.f.s of the elements of x and y, respectively. Then, [{\bi V}_{{\bf x}}={\bi HV}_{{\bf y}}{\bi H}^T]. Let G be the matrix [({\bi A}^T{\bi V}_{{\bf y}}^{-1}{\bi A})^{-1}{\bi A}^T{\bi V}^{-1}], so that [\widehat {{\bf x}}={\bi G}({\bf y}-{\bf b})] is the particular least-squares estimate for which [{\bi W}={\bi V}_{{\bf y}}^{-1}]. Then, [{\bi V}_{{\bf x}}={\bi GV}_{{\bf y}}{\bi G}^T]. If [{\bi V}_{{\bf y}}] is positive definite, its lower triangular Cholesky factor, L, exists, so that [{\bi LL}^T={\bi V}_{{\bf y}}]. [If V is diagonal, L is also diagonal, with [L_{ii}=({\bi V}_{{\bf y}})_{ii}^{1/2}.]] It is readily verified that the matrix product [[({\bi H}-{\bi G}){\bi L}][({\bi H}-{\bi G}){\bi L}]^T={\bi HV}_{{\bf y}}{\bi H}^T-{\bi GV}_{{\bf y}}{\bi G}^T], but the diagonal elements of this product are the sums of squares of the elements of rows of (HG)L, and are therefore greater than or equal to zero. Therefore, the diagonal elements of Vx, which are the variances of the marginal p.d.f.s of the elements of [\widehat {{\bf x}}], are minimum when [{\bi W} = {\bi V}_{\bf y}^{-1}].

Thus, the least-squares estimator is unbiased for any positive-definite weight matrix, W, but the variances of the elements of the vector of estimated parameters are minimized if [{\bi W} = {\bi V}_{\bf y}^{-1}]. [Note also that [{\bi V}_{{\bf x}}=({\bi A}^T{\bi WA})^{-1}] if, and only if, [{\bi W} = {\bi V}_{\bf y}^{-1}].] For this reason, the least-squares estimator with weights proportional to the reciprocals of the variances of the observations is referred to as the best linear unbiased estimator for the parameters of a model describing those observations. (These specific results are included in a more general result known as the Gauss–Markov theorem.)

The analysis up to this point has assumed that the model is linear, that is that the expected values of the observations can be expressed by [\langle {\bf y}\rangle = {\bf b} +{\bi A}{\bf x}], where A is some matrix. In crystallography, of course, the model is highly nonlinear, and this assumption is not valid. The principles of linear least squares can be extended to nonlinear model functions by first finding, by numerical methods, a point in parameter space, [{\bf x}_0], at which the gradient vanishes and then expanding the model functions about that point in Taylor's series, retaining only the linear terms. Equation ([link] then becomes [{\bi M}({\bf x})\approx {\bi M}({\bf x}_0)+{\bi A}({\bf x}-{\bf x}_0), \eqno (]where [A_{ij}=\partial M_i({\bf x})/\partial x_j] evaluated at x = x0. Because we have already found the least-squares solution, the estimate [\eqalignno{\widehat {{\bf x}} &={\bf x}_0+({\bi A}^T{\bi WA})^{-1}{\bi A}^T{\bi W}[{\bf y}-{\bi M}({\bf x}_0)] \cr &={\bf x}_0+{\bi H}[{\bf y}-{\bi M}({\bf x}_0)] &(}]reduces to [\widehat {{\bf x}}={\bf x}_0]. It is important, however, not to confuse [{\bf x}_0], which is a convenient origin, with [\widehat {{\bf x}}], which is a random variable describable by a joint p.d.f. with mean [{\bf x}_0] and a variance–covariance matrix [{\bi V}_{{\bf x}}={\bi HV}_{{\bf y}}{\bi H}^T], reducing to [({\bi A}^T{\bi V}_{{\bf y}}^{-1}{\bi A})^{-1}] when [{\bi W}={\bi V}_{{\bf y}}^{-1}].

This variance–covariance matrix is the one appropriate to the linear approximation given in ([link], and it is valid (and the estimate is unbiased) only to the extent that the approximation is a good one. A useful criterion for an adequate approximation (Fedorov, 1972[link]) is, for each j and k, [\eqalignno{ \left | \sum _{i=1}^{n}w_{i}\sigma _{i} \displaystyle{\partial ^{2}M_{i} ({\bf x}_{0}) \over \partial x_{j}\partial x_{k}}\right | &\ll \left (\left \{\sum _{i=1}^{n}w_{i}\left [\displaystyle{\partial M_{i}({\bf x}_{0}) \over \partial x_{j}}\right] ^{2}\right \} \right. \cr&\quad\ \times \left. \left \{ \sum _{i=1}^{n}w_{i}\left [\displaystyle {\partial M_{i}({\bf x}_{0}) \over \partial x_{k}}\right] ^{2}\right \} \right) ^{1/2}, & (}]where σi is the estimated standard deviation or standard uncertainty (Schwarzenbach, Abrahams, Flack, Prince & Wilson, 1995[link]) of [y_{i}]. This criterion states that the curvature of S(y, x) in a region whose size is of order σ in observation space is small; it ensures that the effect of second-derivative terms in the normal-equations matrix on the eigenvalues and eigenvectors of the matrix is negligible. [For a further discussion and some numerical tests of alternatives, see Donaldson & Schnabel (1986[link]).]

The process of refinement can be viewed as the construction of a conditional p.d.f. of a set of model parameters, x, given a set of observations, y. An important expression for this p.d.f. is derived from two equivalent expressions for the joint p.d.f. of x and y: [\Phi _{J}({\bf x},{\bf y})=\Phi _{C}({\bf x}|{\bf y})\Phi _{M}({\bf y})=\Phi _{C}({\bf y}|{\bf x})\Phi _{M}({\bf x}). \eqno (]Provided [\Phi _{M}({\bf y})\, \gt \, 0], the conditional p.d.f. we seek can be written [\Phi _{C}({\bf x}|{\bf y})=\Phi _{C}({\bf y}|{\bf x})\Phi _{M}({\bf x})/\Phi _{M}({\bf y}). \eqno (]Here, the factor [[1/\Phi _{M}({\bf y})]] is the factor that is required to normalize the p.d.f. [\Phi _{C}({\bf y}|{\bf x})] is the conditional probability of observing a set of values of y as a function of x. When the observations have already been made, however, this can also be considered a density function for x that measures the likelihood that those particular values of y would have been observed for various values of x. It is therefore frequently written [\ell ({\bf x}|{\bf y})], and ([link] becomes [\Phi _{C}({\bf x}|{\bf y})=c\ell ({\bf x}|{\bf y})\Phi _{M}({\bf x}), \eqno (]where [c=[1/\Phi _{M}({\bf y})]] is the normalizing constant. [\Phi _{M}({\bf x})], the marginal p.d.f. of x in the absence of any additional information, incorporates all previously available information concerning x, and is known as the prior p.d.f., or, frequently, simply as the prior of x. Similarly, ΦC(x|y) is the posterior p.d.f., or the posterior, of x. The relation in ([link] and ([link] was first stated in the eighteenth century by Thomas Bayes, and it is therefore known as Bayes's theorem (Box & Tiao, 1973[link]). Although its validity has never been in serious question, its application has divided statisticians into two vehemently disputing camps, one of which, the frequentists, considers that Bayesian methods give nonobjective results, while the other, the Bayesians, considers that only by careful construction of a `noninformative' prior can true objectivity be achieved (Berger & Wolpert, 1984[link]).

Diffraction data, in general, contain no phase information, so the likelihood function for the structure factor, F, given a value of observed intensity, will have a value significantly different from zero in an annular region of the complex plane with a mean radius equal to |F|. Because this is insufficient information with which to determine a crystal structure, a prior p.d.f. is constructed in one (or some combination) of two ways. Either the prior knowledge that electron density is non negative is used to construct a joint p.d.f. of amplitudes and phases, given amplitudes for all reflections and phases for a few of them (direct methods), or chemical knowledge and intuition are used to construct a trial structure from which structure factors can be calculated, and the phase of Fcalc is assigned to Fobs. Both of these procedures can be considered to be applications of Bayes's theorem. In fact, Fcalc for a refined structure can be considered a Bayesian estimate of F.


Berger, J. O. & Wolpert, R. L. (1984). The likelihood principle. Hayward, CA: Institute of Mathematical Statistics.
Box, G. E. P. & Tiao, G. C. (1973). Bayesian inference in statistical analysis. Reading, MA: Addison-Wesley.
Donaldson, J. R. & Schnabel, R. B. (1986). Computational experience with confidence regions and confidence intervals for nonlinear least squares. Computer science and statistics. Proceedings of the Seventeenth Symposium on the Interface, edited by D. M. Allen, pp. 83–91. New York: North-Holland.
Fedorov, V. V. (1972). Theory of optimal experiments, translated by W. J. Studden & E. M. Klimko. New York: Academic Press.
Schwarzenbach, D., Abrahams, S. C., Flack, H. D., Prince, E. & Wilson, A. J. C. (1995). Statistical descriptors in crystallography. II. Report of a Working Group on Expression of Uncertainty in Measurement. Acta Cryst. A51, 565–569.

to end of page
to top of page