International
Tables for Crystallography Volume F Crystallography of biological macromolecules Edited by E. Arnold, D. M. Himmel and M. G. Rossmann © International Union of Crystallography 2012 
International Tables for Crystallography (2012). Vol. F, ch. 15.2, pp. 401406
https://doi.org/10.1107/97809553602060000848 Chapter 15.2. Model phases: probabilities, bias and maps^{a}Department of Haematology, University of Cambridge, CIMR, Wellcome Trust/MRC Building, Hills Road, Cambridge CB2 0XY, England The optimal use of model phase information requires an estimate of its reliability, specifically the probability that various values of the phase angle are true. This chapter covers the importance of phase in model bias; structurefactor probability relationships; figureofmerit weighting for model phases; map coefficients to reduce model bias; differencemap coefficients; refinement bias; and maximiumlikelihood structure refinement. 
The intensities of Xray diffraction spots measured from a crystal give us only the amplitudes of the diffracted waves. To reconstruct a map of the electron density in the crystal, the unmeasured phase information is also required. In fact, the phases are much more important to the appearance of the map than the measured amplitudes. When phases are supplied by an atomic model, therefore, some degree of model bias is inevitable.
The optimal use of model phase information requires an estimate of its reliability, specifically the probability that various values of the phase angle are true. Such a probability distribution can be derived, starting first with the relationship between the structure factor (amplitude and phase) of the model and that of the true crystal structure. The phase probability distribution can then be obtained from this and used, for instance, to provide a figureofmerit weighting that minimizes the r.m.s. error from the true electron density.
Even with figureofmerit weighting, modelphased electron density is biased towards the model. The systematic bias component of modelphased map coefficients can be predicted, allowing the derivation of map coefficients that give electrondensity maps with reduced model bias. With the help of a few simple assumptions, a correction for bias can also be made when different sources of phase information are combined.
Finally, the refinement of a model against the observed amplitudes allows a certain amount of overfitting of the data, which leads to an extra `refinement bias'. Fortunately, the use of appropriate refinement strategies, including maximumlikelihood targets, can reduce the severity of this problem.
Dramatic illustrations of the importance of the phase have been published. For instance, Ramachandran & Srinivasan (1961) calculated an electrondensity map using phases from one structure and amplitudes from another. In this map there were peaks at the positions of the atoms in the structure that contributed the phase information, but not in the structure that contributed the amplitudes. Similar calculations with twodimensional Fourier transforms of photographs (Oppenheim & Lim, 1981; Read, 1997) showed that the phases of one completely overwhelmed the amplitudes of the other.
These examples, though dramatic, are not completely representative of the normal situation, where the structure contributing the phases is partially or even nearly correct. Nonetheless, model phases always contribute bias, so that the resulting map tends to bear too close a resemblance to the model.
The importance of the phase can be understood most easily in terms of Parseval's theorem, a result that is important to the understanding of many aspects of the Fourier transform and its use in crystallography. Parseval's theorem states that the meansquare value of the variable on one side of a Fourier transform is proportional to the meansquare value of the variable on the other side. Since the Fourier transform is additive, Parseval's theorem also applies to sums or differences.
If and are, for instance, the true electron density and the electron density of the model, respectively, Parseval's theorem tells us that the r.m.s. error in the electron density is proportional to the r.m.s. error in the structure factor. (The structurefactor error is a vector error in the complex plane.)
This understanding of error in electrondensity maps explains why the phase is much more important than the amplitude in determining the appearance of an electrondensity map. As illustrated in Fig. 15.2.2.1, a random choice of phase (from a uniform distribution of all possible phases) will generally give a larger error in the complex plane than a random choice of amplitude [from a Wilson (1949) distribution of amplitudes].
To use model phase information optimally, the probability distribution for the true phase (or, equivalently, the distribution of the error in the model phase) needs to be known. Such a distribution can be derived by first working out the probability distribution for the true structure factor (or the distribution of the vector difference between the model and true structure factors). Then the phase probability distribution is obtained by fixing the known value of the structurefactor amplitude and renormalizing.
A number of related structurefactor distributions have been derived, differing in the amount of information available about the structure and in the assumed form of errors in the model. These range from the Wilson distribution, which applies when none of the atomic positions is known, to a distribution that applies when there are a variety of sources of error in an atomic model.
For the Wilson distribution (Wilson, 1949), it is assumed that the atoms in a crystal structure in space group P1 are scattered randomly and independently through the unit cell. In fact, it is sufficient to make the much less restrictive assumption that the atoms are placed randomly with respect to the Bragg planes defined by the Miller indices. The assumption of independence is somewhat more problematic, since there are restrictions on the distances between atoms, large volumes of protein crystals are occupied by disordered solvent and many protein crystals display noncrystallographic symmetry; as discussed elsewhere (Vellieux & Read, 1997; Kleywegt & Read, 1997), the resulting relationships among structure factors are exploited implicitly in averaging and solventflattening procedures. The higherorder relationships among structure factors are used explicitly in direct methods for solving smallmolecule structures and are being developed for use in protein structures (Bricogne, 1993). For the purposes of simpler relationships between the calculated and true structure factors for a single hkl, however, the lack of complete independence does not seem to create serious problems.
When atoms are placed randomly relative to the Bragg planes, the contribution of each atom to the structure factor will have a phase varying randomly from 0 to 2π. The overall structure factor can then be considered to be the result of a random walk in the complex plane, which can be treated as an application of the central limit theorem. The structure factor is the sum of the independent atomic scattering contributions, each of which has a probability distribution defined as a circle in the complex plane centred on the origin, with a radius of . The centroid of this atomic distribution is at the origin, and the variance for each of the real and imaginary parts is . The probability distribution of the structure factor that is the sum of these contributions is a twodimensional Gaussian, the product of the onedimensional Gaussians for the real and imaginary parts. Because the variances are equal in the real and imaginary directions, it can be simplified, as shown below, and expressed in terms of a single distribution parameter, . Alternatively, and more simply, the structurefactor probability distribution can be considered as a complex normal distribution, arising from application of the central limit theorem to complex variables (Read, 2003).
The Sim distribution (Sim, 1959), which is relevant when the positions of some of the atoms are known, has a very similar basis, except that the structure factor is now considered to arise from a random walk starting from the position of the structure factor corresponding to the known part, . Atoms with known positions do not contribute to the variance, while each of the atoms with unknown positions (the `Q' atoms) contributes to each of the real and imaginary parts, as in the Wilson distribution. The distribution parameter in this case is referred to as . The Sim distribution is a conditional probability distribution, depending on the value of ,
The Wilson (1949) and Woolfson (1956) distributions for space group are obtained similarly, except that the random walks are along a line and the resulting Gaussian distributions are onedimensional. (The Woolfson distribution is the centric equivalent of the Sim distribution.) For more complicated space groups, it is reasonable to assume that acentric reflections follow the P1 distribution and that centric reflections follow the distribution. However, for any zone of the reciprocal lattice in which symmetryrelated atoms are constrained to scatter in phase, the variances must be multiplied by the expected intensity factor, , for the zone, because the symmetryrelated contributions are no longer independent.
In the Sim distribution, an atom is considered to be either exactly known or completely unknown in its position. These are extreme cases, since there will normally be varying degrees of uncertainty in the positions of various atoms in a model. The treatment can be generalized by allowing a probability distribution of coordinate errors for each atom. In this case, the centroid for the individual atomic contribution to the structure factor will no longer be obtained by multiplying by either zero or one. Averaged over the circle corresponding to possible phase errors, the centroid will generally be reduced in magnitude, as illustrated in Fig. 15.2.3.1. In fact, averaging to obtain the centroid is equivalent to weighting the atomic scattering contribution by the Fourier transform of the coordinateerror probability distribution, . By the convolution theorem, this in turn is equivalent to convoluting the atomic density with the coordinateerror distribution. Intuitively, the atom is smeared over all of its possible positions. The weighting factor, , is thus analogous to the thermalmotion term in the structurefactor expression.

Centroid of the structurefactor contribution from a single atom. The probability of a phase for the contribution is indicated by the thickness of the line. 
The variances for the individual atomic contributions will differ in magnitude, but if there are a sufficient number of independent sources of error, we can invoke the central limit theorem again and assume that the probability distribution for the structure factor will be a Gaussian centred on . If the coordinateerror distribution is Gaussian, and if each atom in the model is subject to the same errors, the resulting structurefactor probability distribution is the Luzzati (1952) distribution. In this special case, for all atoms, where D is the Fourier transform of a Gaussian and behaves like the application of an overall B factor.
The Wilson, Sim, Luzzati and variableerror distributions have very similar forms, because they are all Gaussians arising from the application of the central limit theorem. The central limit theorem is valid under many circumstances; even when there are errors in position, scattering factor and B factor, as well as missing atoms, a similar distribution still applies. As long as these sources of error are independent, the true structure factor will have a Gaussian distribution centred on (Fig. 15.2.3.2), where D now includes effects of all sources of error, as well as compensating for errors in the overall scale and B factor (Read, 1990). in the acentric case, where , is the expected intensity factor and is the Wilson distribution parameter for the model.

Schematic illustration of the general structurefactor distribution, relevant in the case of any set of independent random errors in the atomic model. 
For centric reflections, the scattering differences are distributed along a line, so the probability distribution is a onedimensional Gaussian.
Srinivasan (1966) showed that the Sim and Luzzati distributions could be combined into a single distribution that had a particularly elegant form when expressed in terms of normalized structure factors, or E values. This functional form still applies to the general distribution that reflects a variety of sources of error; the only difference is the interpretation placed on the parameters (Read, 1990). If F and are replaced by the corresponding E values, a parameter plays the role of D, and reduces to (). [The parameter is equivalent to D after correction for model completeness; ] When the structure factors are normalized, overall scale and Bfactor effects are also eliminated. The parameter that characterizes this probability distribution varies as a function of resolution. It must be deduced from the amplitudes and , since the phase (thus the phase difference) is unknown.
A general approach to estimating parameters for probability distributions is to maximize a likelihood function. The likelihood function is the overall joint probability of making the entire set of observations, which is a function of the desired parameters. The parameters that maximize the probability of making the set of observations are the most consistent with the data. The idea of using maximum likelihood to estimate model phase errors was introduced by Lunin & Urzhumtsev (1984), who gave a treatment that was valid for space group P1. In a more general treatment that applies to highersymmetry space groups, allowance is made for the statistical effects of crystal symmetry (centric zones and differing expected intensity factors) (Read, 1986).
The values are estimated by maximizing the joint probability of making the set of observations of . If the structure factors are all assumed to be independent, the joint probability distribution is the product of all the individual distributions. The assumption of independence is not completely justified in theory, but the results are fairly accurate in practice. The required probability distribution, , is derived from by integrating over all possible phase differences. The form of this distribution, which is given in other publications (Read, 1986, 1990), differs for centric and acentric reflections. (It is important to note that although the distributions for structure factors are Gaussian, the distributions for amplitudes obtained by integrating out the phase are not.) It is more convenient to deal with a sum than a product, so the log likelihood function is maximized instead. In the program SIGMAA, reciprocal space is divided into spherical shells, and a value of the parameter is refined for each resolution shell. Details of the algorithm are given elsewhere (Read, 1986).
The resolution shells must be thick enough to contain several hundred to a thousand reflections each, in order to provide estimates with a sufficiently small statistical error. A larger number of shells (fewer reflections per shell) can be used for refined structures, since estimates of become more precise as the true value approaches 1. If there are sufficient reflections per shell, the estimates will vary smoothly with resolution. As discussed below, the smooth variation with resolution can also be exploited through a restraint that allows values to be estimated from fewer reflections.
For unrefined models, the disagreement between the true and calculated structure factors is dominated by the effect of defects in the model, so that the effect of errors in measuring the observed intensities or amplitudes can be neglected (Read, 1986). As the model improves, the effect of measurement error becomes relevant. Three approaches have been used to account for measurement error. In one approach, first proposed by Green (1979) in the context of isomorphous replacement, measurement error is treated as a complex error, and the variance term, σ, is increased accordingly (Murshudov et al., 1997; Bricogne & Irwin, 1996). This has the virtue of simplicity and it works well. Another approach is to use a Gaussian approximation to the likelihood function expressed in terms of amplitudes, and then increase the variance of this Gaussian (Pannu & Read, 1996). Finally, the convolution of the likelihood function (expressed in intensities) and a Gaussian measurement error can be evaluated by using a series approximation (Pannu & Read, 1996). This has the benefit that it allows for the presence of negative net intensity measurements, but it is more complicated to implement and is not widely used.
Blow & Crick (1959) and Sim (1959) showed that the electrondensity map with the least r.m.s. error is calculated from centroid structure factors. This conclusion follows from Parseval's theorem, because the centroid structure factor (its probabilityweighted average value or expected value) minimizes the r.m.s. error of the structure factor. Since the structurefactor distribution is symmetrical about , the expected value of F will have the same phase as , but the averaging around the phase circle will reduce its magnitude if there is any uncertainty in the phase value (Fig. 15.2.4.1). We treat the reduction in magnitude by applying a weighting factor called the figure of merit, m, which is equivalent to the expected value of the cosine of the phase error.
A figureofmerit weighted map, calculated with coefficients , has the least r.m.s. error from the true map. According to the normal statistical (minimum variance) criteria, then, it is the best map. However, such a map will suffer from model bias; if its purpose is to allow the detection and repair of errors in the model, this is a serious qualitative defect. Fortunately, it is possible to predict the systematic errors leading to model bias and to make some correction for them.
Main (1979) dealt with this problem in the case of a perfect partial structure. Since the relationships among structure factors are the same in the general case of a partial structure with various errors, once is substituted for , all that is required to apply Main's results more generally is a change of variables (Read, 1986, 1990).
In Main's approach, the cosine law is used to introduce the cosine of the phase error, which is converted into a figure of merit by taking expected values. Some manipulations allow us to solve for the figureofmerit weighted map coefficient, which is approximated as a linear combination of the true structure factor and the model structure factor (Main, 1979; Read, 1986). Finally, we can solve for an approximation to the true structure factor, giving map coefficients from which the systematic model bias component has been removed.
A similar analysis for centric structure factors shows that there is no systematic model bias in figureofmerit weighted map coefficients, so no bias correction is needed in the centric case.
When model phase information is combined with, for instance, multiple isomorphous replacement (MIR) phase information, there will still be model bias in the acentric map coefficients, to the extent that the model influences the final phases. However, it is inappropriate to continue using the same map coefficients to reduce model bias, because some phases could be determined almost completely by the MIR phase information. It makes much more sense to have map coefficients that reduce to the coefficients appropriate for either model or MIR phases, in extreme cases where there is only one source of phase information, and that vary smoothly between those extremes.
Map coefficients that satisfy these criteria (even if they are not rigorously derived) are implemented in the program SIGMAA. The resulting maps are reasonably successful in reducing model bias. Two assumptions are made: (1) the modelbias component in the figureofmerit weighted map coefficient, , is proportional to the influence that the model phase has had on the combined phase; and (2) the relative influence of a source of phase information can be measured by the information content, H (Guiasu, 1977), of the phase probability distribution. The first assumption corresponds to the idea that the figureofmerit weighted map coefficient is a linear combination of the MIR and model phase cases. where and
Solving for an approximation to the true F gives the following expression, which can be seen to reduce appropriately when w is 0 (no model influence) or 1 (no MIR influence):
In principle, since the distribution of observed and calculated amplitudes is determined largely by the coordinate errors of the model, one can determine whether a particular coordinateerror distribution is consistent with the amplitudes. Unfortunately, it turns out that the coordinate errors cannot be deduced unambiguously, because many distributions of coordinate errors are consistent with a particular distribution of amplitudes (Read, 1990).
If the simplifying assumption is made that all the atoms are subject to a single error distribution, then the parameter D (and thus the related parameter ) varies with resolution as the Fourier transform of the error distribution, as discussed above. Two related methods to estimate overall coordinate error are based on the even more specific assumption that the coordinateerror distribution is Gaussian: the Luzzati plot (Luzzati, 1952) and the plot (Read, 1986). Unfortunately, the central assumption is not justified; atoms that scatter more strongly (heavier atoms or atoms with lower B factors) tend to have smaller coordinate errors than weakly scattering atoms. The proportion of the structure factor contributed by well ordered atoms increases at high resolution, so that the structure factors agree better at high resolution than if there were a single error distribution.
It is often stated, optimistically, that the Luzzati plot provides an upper bound to the coordinate error, because the observation errors in have been ignored. This is misleading, because there are other effects that cause the Luzzati and plots to give underestimates (Read, 1990). Chief among these are the correlation of errors and scattering power and the overfitting of the amplitudes in structure refinement (discussed below). These estimates of overall coordinate error should not be interpreted too literally; at best, they provide a comparative measure.
The computer program SIGMAA (Read, 1986) has been developed to implement the results described here. Apart from the two types of map coefficient discussed above, two types of differencemap coefficient can also be produced:
The general difference map, it should be noted, uses a vector difference between the figureofmerit weighted combined phase coefficient (the `best' estimate of the true structure factor) and the calculated structure factor. When additional phase information is available, it should provide a clearer picture of the errors in the model.
Similar algorithms have been implemented as part of modern refinement programs such as Refmac (Murshudov et al., 1997) and phenix.refine (Afonine et al., 2005).
The structurefactor probabilities discussed above depend on the atoms having independent errors (or at least a sufficient number of groups of atoms having independent errors). Unfortunately, this assumption breaks down when a structure is refined against the observed diffraction data. Few protein crystals diffract to sufficiently high resolution to provide a large number of observations for every refinable parameter. The refinement problem is, therefore, not sufficiently overdetermined, so it is possible to overfit the data. If there is an error in the model that is outside the range of convergence of the refinement method, it is possible to introduce compensating errors in the rest of the structure to give a better, and misleading, agreement in the amplitudes. As a result, the phase accuracy (hence the weighting factors m and D) is overestimated, and model bias is poorly removed.
There is another interpretation to the problem of refinement bias. As Silva & Rossmann (1985) point out, minimizing the r.m.s. difference between the amplitudes and is equivalent (by Parseval's theorem) to minimizing the difference between the model electron density and the density corresponding to the map coefficients ; a lower residual is obtained either by making the model look more like the true structure, or by making the modelphased map look more like the model through the introduction of systematic phase errors.
A number of strategies are available to reduce the degree or impact of refinement bias. The overestimation of phase accuracy can be avoided by obtaining unbiased estimates of the σ_{A} values from the crossvalidation data, which were originally introduced to compute as an unbiased indicator of refinement progress (Brünger, 1992). Because of the high statistical error of estimates computed from small numbers of reflections, reliable values can only be obtained by exploiting the smoothness of the curve as a function of resolution. This can be achieved either by fitting a functional form (as done in Refmac; Murshudov et al., 1997) or by adding a penalty to points that deviate from the line connecting their neighbours (as done in CNS; Brunger et al., 1998).
If errors are suspected in certain parts of the structure, `omit refinement' (in which the questionable parts are omitted from the model) can be a very effective way to eliminate refinement bias in those regions (James et al., 1980; Hodel et al., 1992).
If MIR or MAD (multiwavelength anomalous dispersion) phases are available, combined phase maps tend to suffer less from refinement bias, depending on the extent to which the experimental phases influence the combined phases. Finally, it is always a good idea to refer occasionally to the original MIR or MAD map, which cannot suffer at all from model bias or refinement bias.
Acknowledgements
This chapter is a revised version of a contribution to Methods in Enzymology (Read, 1997).
References
Afonine, P. V., GrosseKunstleve, R. W. & Adams, P. D. (2005). CCP4 Newsletter, No. 42, contribution 8.Blow, D. M. & Crick, F. H. C. (1959). The treatment of errors in the isomorphous replacement method. Acta Cryst. 12, 794–802.
Bricogne, G. (1993). Direct phase determination by entropy maximization and likelihood ranking: status report and perspectives. Acta Cryst. D49, 37–60.
Bricogne, G. & Irwin, J. (1996). In Proceedings of the CCP4 Study Weekend. Macromolecular Refinement, edited by E. Dodson, M. Moore, A. Ralph & S. Bailey, pp. 85–92. Warrington: Daresbury Laboratory.
Brünger, A. T. (1992). Free R value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature (London), 355, 472–474.
Brünger, A. T., Adams, P. D., Clore, G. M., DeLano, W. L., Gros, P., GrosseKunstleve, R. W., Jiang, J.S., Kuszewski, J., Nilges, M., Pannu, N. S., Read, R. J., Rice, L. M., Simonson, T. & Warren, G. L. (1998). Crystallography & NMR System: a new software suite for macromolecular structure determination. Acta Cryst. D54, 905–921.
Green, E. A. (1979). A new statistical model for describing errors in isomorphous replacement data: the case of one derivative. Acta Cryst. A35, 351–359.
Guiasu, S. (1977). Information Theory with Applications. London: McGrawHill.
Hodel, A., Kim, S.H. & Brünger, A. T. (1992). Model bias in macromolecular crystal structures. Acta Cryst. A48, 851–858.
James, M. N. G., Sielecki, A. R., Brayer, G. D., Delbaere, L. T. J. & Bauer, C.A. (1980). Structures of product and inhibitor complexes of Streptomyces griseus protease A at 1.8 Å resolution – a model for serine protease catalysis. J. Mol. Biol. 144, 43–88.
Kleywegt, G. J. & Read, R. J. (1997). Ways & means: not your average density. Structure, 5, 1557–1569.
Lunin, V. Yu. & Urzhumtsev, A. G. (1984). Improvement of protein phases by coarse model modification. Acta Cryst. A40, 269–277.
Luzzati, V. (1952). Traitement statistique des erreurs dans la determination des structures cristallines. Acta Cryst. 5, 802–810.
Main, P. (1979). A theoretical comparison of the β, γ′ and 2F_{o} − F_{c} syntheses. Acta Cryst. A35, 779–785.
Murshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). Refinement of macromolecular structures by the maximumlikelihood method. Acta Cryst. D53, 240–255.
Oppenheim, A. V. & Lim, J. S. (1981). The importance of phase in signals. Proc. IEEE, 69, 529–541.
Pannu, N. S. & Read, R. J. (1996). Improved structure refinement through maximum likelihood. Acta Cryst. A52, 659–668.
Ramachandran, G. N. & Srinivasan, R. (1961). An apparent paradox in crystal structure analysis. Nature (London), 190, 159–161.
Read, R. J. (1986). Improved Fourier coefficients for maps using phases from partial structures with errors. Acta Cryst. A42, 140–149.
Read, R. J. (1990). Structurefactor probabilities for related structures. Acta Cryst. A46, 900–912.
Read, R. J. (1997). Model phases: probabilities and bias. Methods Enzymol. 277, 110–128.
Read, R. J. (2003). New ways of looking at experimental phasing. Acta Cryst. D59, 1891–1902.
Silva, A. M. & Rossmann, M. G. (1985). The refinement of southern bean mosaic virus in reciprocal space. Acta Cryst. B41, 147–157.
Sim, G. A. (1959). The distribution of phase angles for structures containing heavy atoms. II. A modification of the normal heavyatom method for noncentrosymmetrical structures. Acta Cryst. 12, 813–815.
Srinivasan, R. (1966). Weighting functions for use in the early stages of structure analysis when a part of the structure is known. Acta Cryst. 20, 143–144.
Vellieux, F. M. D. & Read, R. J. (1997). Noncrystallographic symmetry averaging in phase refinement and extension. Methods Enzymol. 277, 18–53.
Wilson, A. J. C. (1949). The probability distribution of Xray intensities. Acta Cryst. 2, 318–321.
Woolfson, M. M. (1956). An improvement of the `heavyatom' method of solving crystal structures. Acta Cryst. 9, 804–810.