International
Tables for
Crystallography
Volume C
Mathematical, physical and chemical tables
Edited by E. Prince

International Tables for Crystallography (2006). Vol. C, ch. 8.5, p. 708

Section 8.5.3. Influential data points

E. Princea and C. H. Spiegelmanb

aNIST Center for Neutron Research, National Institute of Standards and Technology, Gaithersburg, MD 20899, USA, and bDepartment of Statistics, Texas A&M University, College Station, TX 77843, USA

8.5.3. Influential data points

| top | pdf |

Section 8.4.4[link] discusses the influence of individual data points on the estimation of parameters and how to identify the data points that should be measured with particular care in order to make the most precise estimates of particular parameters. The same properties that cause these influential data points to be most effective in reducing the uncertainty of a parameter estimate when the model is a correct predictor for the observations also cause them to have the greatest potential for introducing bias if there is a flaw in the model or, correspondingly, if they are subject to systematic error. Reviews of procedures for studying the effects of influential data points and outliers have been given by Beckman & Cook (1983[link]), by Chatterjee & Hadi (1986[link]), and by Belsley (1991[link]).

The effects of possible systematic error can be studied by identifying influential data points and then observing the effects of deleting them one by one from the refinement. The deletion of a data point should affect the standard uncertainty of an estimate, but should not cause a shift in its mean that is more than a small multiple of the resulting standard uncertainty. As in Section 8.4.4[link] , we define the design matrix, A, by [ A_{ij}=\partial M_i({\bf x})/\partial x_j, \eqno (8.5.3.1)]where [M_i({\bf x})] is the model function for the ith data point, and x is a vector of adjustable parameters. Let R be the upper triangular Cholesky factor of the weight matrix, so that W = RTR, and define the weighted design matrix by Z = RA and the weighted vector of observations by y′ = Ry. The least-squares estimate of x is then [ \widehat {{\bf x}}=({\bi Z}^T{\bi Z})^{-1}{\bi Z}^T{\bf y}^{\prime }, \eqno (8.5.3.2)]and the vector of predicted values is [ \widehat {{\bf y}}^{\prime }={\bi Z}({\bi Z}^T{\bi Z})^{-1}{\bi Z}^T{\bf y}^{\prime }={\bi P}{\bf y}^{\prime }, \eqno (8.5.3.3)]where P is the projection, or hat, matrix. A diagonal element, [P_{ii}], of P is a measure of the leverage, that is of the relative influence, of the ith data point, and therefore of the sensitivity of the estimates of the elements of x to an error in the measurement of that data point. [P_{ii}] lies in the range [0\leq P_{ii}\leq 1], and it has average value p/n, so that data points with values of [P_{ii}] greater than 2p/n can be considered particularly influential.

Let H = ZTZ be the normal-equations matrix, let V = H−1 be the estimated variance–covariance matrix, and let [{\bf q}={\bi Z}^T{\bf y}^{\prime }], so that [\widehat {{\bf x}}={\bi V}{\bf q}]. Let [{\bf z}_i] be the ith row of Z, and denote by [{\bi Z}^{(i)}], [{\bi H}^{(i)}], [{\bi V}^{(i)}], [{\bf q}^{(i)}], and [\widehat {{\bf x}}^{(i)}] the respective matrices and vectors computed with the ith data point deleted from the data set. We wish to find large values of [|\widehat {x}_j-\widehat {x}\,_j^{(i)}|/[V_{jj}^{(i)}]^{1/2}], so we need to compute [{\bi V}^{(i)}] and [{\bf x}^{(i)}]. With a derivation similar to that for (8.4.4.7)[link] , it can be shown (Fedorov, 1972[link]; Prince & Nicholson, 1985[link]) that [ {\bi V}^{(i)}={\bi V}+ {{\bi V}{\bf z}_i^T{\bf z}_i{\bi V} \over (1-{\bf z}_i{\bi V}{\bf z}_i^T)}={\bi V}+ {{\bi V}{\bf z}_i^T{\bf z}_i{\bi V}\over (1-P_{ii})}. \eqno (8.5.3.4)]Note that, if [P_{ii}=1], all elements of [{\bi V}^{(i)}] become infinite, implying that [{\bi H}^{(i)}] is singular. Thus, if such a data point is deleted, the solution is no longer determinate. Now, [ \widehat {{\bf x}}\,^{(i)}={\bi V}^{(i)}{\bf q}^{(i)} \eqno (8.5.3.5)]and [ {\bf q}^{(i)}={\bf q}-y_i^{\prime }{\bf z}_i^T, \eqno (8.5.3.6)]so that, when V and [\widehat{\bf x}] have been computed once, it is a straightforward and inexpensive additional computation to determine whether any parameter has been strongly influenced, and therefore potentially biased, by the inclusion of any data point in the refinement. If there is any reason to be concerned about possible systematic error, the leverage of every data point included in the refinement should be computed, and the effects of deletion of all of those with leverage greater than 2p/n should be observed.

References

Beckman, R. J. & Cook, R. D. (1983). Outlier..........s. Technometrics, 25, 119–149.
Belsley, D. A. (1991). Conditioning diagnostics. New York: John Wiley & Sons.
Chatterjee, S. & Hadi, A. S. (1986). Influential observations, high leverage points, and outliers in linear regression. Stat. Sci. 1, 379–393.
Fedorov, V. V. (1972). Theory of optimal experiments, translated by W. J. Studden & E. M. Klimko. New York: Academic Press.
Prince, E. & Nicholson, W. L. (1985). Influence of individual reflections on the precision of parameter estimates in least squares refinement. Structure and statistics in crystallography, edited by A. J. C. Wilson, pp. 183–195. Guilderland, NY: Adenine Press.








































to end of page
to top of page