InternationalCrystallography of biological macromoleculesTables for Crystallography Volume F Edited by M. G. Rossmann and E. Arnold © International Union of Crystallography 2006 |
International Tables for Crystallography (2006). Vol. F, ch. 11.2, pp. 212-217
https://doi.org/10.1107/97809553602060000675 ## Chapter 11.2. Integration of macromolecular diffraction data In this chapter the integration of macromolecular diffraction data from two-dimensional area detectors is described. Data integration refers to the process of obtaining estimates of diffracted intensities (and their standard deviations) from the raw images recorded by an X-ray detector. When collecting data, a decision has to be taken about the magnitude of the angular rotation of the crystal during the recording of each image: the rotation per image can be comparable to, or greater than, the angular reflection range of a typical reflection (coarse ϕ slicing), or it can be much less than the reflection width (fine ϕ slicing). The latter approach allows the use of three-dimensional profile fitting and, providing that the detector is relatively noise-free, improves the quality of the resulting data by minimizing the contribution of the X-ray background to the total measured intensity. Methods of integration are described and integration by simple summation and by profile fitting is discussed. |

Data integration refers to the process of obtaining estimates of diffracted intensities (and their standard deviations) from the raw images recorded by an X-ray detector. As two-dimensional (2D) area detectors are almost universally used to collect macromolecular diffraction data, only this type of detector will be considered in the following analysis.

When collecting data with a 2D area detector, a decision has to be taken about the magnitude of the angular rotation of the crystal during the recording of each image. Two distinct modes of operation are possible: the rotation per image can be comparable to, or greater than, the angular reflection range of a typical reflection (coarse ϕ slicing), or it can be much less than the reflection width (fine ϕ slicing). The latter approach allows the use of three-dimensional profile fitting and, providing that the detector is relatively noise-free, improves the quality of the resulting data by minimizing the contribution of the X-ray background to the total measured intensity. However, there are significant overheads associated with recording, storing and processing the relatively large number of images that are required. Three-dimensional profile fitting is described in Chapter 11.3 and will not be discussed here.

Only the integration procedure itself will be described in detail in this article. However, in order to obtain the highest quality data possible from a given set of images, there are a number of parameters that need to be determined in advance of, or during, the integration. The most important of these are the unit-cell parameters , which should be determined to an accuracy of a few parts in a thousand (or better). Post-refinement procedures (Winkler *et al.*, 1979; Rossmann *et al.*, 1979), which make use of the estimated ϕ centroids of observed spots rather than their detector coordinates, generally provide more accurate estimates than methods based on the spot positions. This is because spot positions are affected by residual spatial distortions (after applying appropriate corrections) and the cell parameters are correlated with the crystal-to-detector distance, which is not always accurately known. For either method, it is necessary to include data from widely separated regions of reciprocal space (ideally ϕ values 90° apart) in order to determine all unit-cell parameters accurately. This is particularly important for lower-symmetry space groups.

The crystal orientation also needs to be known to an accuracy that corresponds to a few per cent of the reflection width. For crystals with low mosaicity (*e.g.* 0.1°) this corresponds to a hundredth of a degree or better. Fortunately, it is a feature of post refinement that the error in determining the orientation is typically a few per cent of the reflection width, and so this condition can generally be met. It is important to allow for movement of the crystal by continuously updating the crystal orientation during integration. This is even true when using cryo-cooled crystals, as the magnetic couplings that attach the pin (holding the crystal) to the goniometer head are not strong enough to prevent small movements, particularly with the high angular rotation rates employed on intense synchrotron beamlines. Non-orthogonality of the incident X-ray beam and the rotation axis (if not allowed for) or an off-centre crystal will also give rise to apparent changes in crystal orientation with spindle rotation.

The crystal mosaicity can be estimated by visual inspection and refined by post refinement. Refined values are quite reliable when the mosaic spread is less than about 0.5°, but become more dependent on the rocking-curve model for the high mosaicities that are often associated with frozen crystals. The presence of diffuse scatter, which appears as haloes around the Bragg diffraction spots, presents further difficulties in determining the correct mosaic spread. When processing coarse-sliced images it is preferable to overestimate the mosaic spread slightly (rather than underestimate it). This will result in an increase in random errors (by adding in the X-ray background from an image on which the spot is not actually present), whereas using too small a value can give systematic errors (by underestimating the number of images on which the spot lies).

Detector calibration is essential for high data quality. Both the spatial distortion and the non-uniformity of response of the detector must be accurately known, and it is equally important that these corrections are stable over the timescale of the experiment (and preferably for much longer).

Finally, the crystal-to-detector distance, the detector orientation and the direct-beam position must be refined and continuously updated during integration, using observed spot positions. The crystal-to-detector distance can vary during data collection if the crystal is not exactly centred on the rotation axis, and the direct-beam position can move after a beam refill at a synchrotron. For image-plate detectors with two (or more) plates, the direct-beam position and detector distance often differ slightly for different plates.

With appropriate care, it is normally possible to predict reflection positions on the detector to an accuracy of 20–30 µm, or a fraction of the pixel size, particularly for highly collimated X-ray beams available at synchrotron sources. This level of accuracy is necessary to minimize possible systematic errors, particularly in the case of profile fitting.

There are two quite distinct procedures available for determining the integrated intensities: summation integration and profile fitting. Summation integration involves simply adding the pixel values for all pixels lying within the area of a spot, and then subtracting the estimated background contribution to the same pixels. Profile fitting (Diamond, 1969; Ford, 1974; Rossmann, 1979) assumes that the actual spot shape or profile is known (in two or three dimensions) and the intensity is derived by finding the scale factor that, when applied to the known (or standard) profile, gives the best fit to the observed spot profile. In practice, profile fitting requires two separate steps: the determination of the standard profiles and the evaluation of the profile-fitting intensities. As will be shown later, profile fitting results in a reduction in the random error associated with weak intensities, but offers no improvement for very high intensities.

X-ray scattering from air, the sample holder and the specimen itself gives rise to a general background in the images which has to be subtracted in order to obtain the Bragg intensities. Ideally, the background should be measured for the same pixels used to record the Bragg diffraction spot, but this is not usually practical and the background is determined using pixels immediately adjacent to the spot. In practice, the pixels to be used for the determination of the background (background pixels) and those to be used for evaluating the intensity (peak pixels) are defined using a `measurement box'. This is a rectangular box of pixels centred on the predicted spot position. Each pixel within the box is classified as being a background or a peak pixel (or neither). This mask can either be defined by the user, or the classification can be made automatically by the program. An example of a possible measurement-box definition is given in Fig. 11.2.4.1. The background parameters NRX, NRY and NC can be optimized automatically by maximizing the ratio of the intensity divided by its standard deviation, in a manner analogous to that described by Lehmann & Larsen (1974). It is generally assumed that the background can be adequately modelled as a plane, and the plane constants are determined using the background pixels. This allows the background to be estimated for the peak pixels, so that the background-corrected intensity can be calculated.

The background plane constants *a*, *b*, *c* are determined by minimizing where is the total counts at the pixel with coordinates with respect to the centre of the measurement box, and the summation is over the *n* background pixels. is a weight which should ideally be the inverse of the variance of . Assuming that the variance is determined by counting statistics, this gives where *G* is the gain of detector, which converts pixel counts to equivalent X-ray photons, and is the expectation value of the background counts . In practice, the variation in background across the measurement box is usually sufficiently small that all weights can be considered to be equal.

This gives the following equations for *a*, *b* and *c*, as given in Rossmann (1979), where all summations are over the *n* background pixels.

It is not unusual for the diffraction pattern to display features other than the Bragg diffraction spots from the crystal of interest. Possible causes are the presence of a satellite crystal or twin component, white-radiation streaks, cosmic rays or zingers. In order to minimize their effect on the determination of the background plane constants, the following outlier rejection algorithm is employed:

The rationale for using a subset of the pixels with the lowest pixel values in step (1) is that the presence of zingers or cosmic rays, or a strongly diffracting satellite crystal, can distort the initial calculation of the background plane so much that it becomes difficult to identify the true outliers. Such features will normally only affect a small percentage of the background pixels and will invariably give higher than expected pixel counts. Selecting a subset with the lowest pixel values will facilitate identification of the true outliers. The initial bias in the resulting plane constant *c* due to this procedure will be corrected in step (3). Poisson statistics are used to evaluate the standard deviations used in outlier rejection, and the standard deviation used in step (2) is increased to allow for the choice of background pixels in step (1).

The summation integration intensity is given by where the summation is over the *m* pixels in the peak region of the measurement box. If the peak region has *mm* symmetry, this simplifies to To evaluate the standard deviation, this can be written as where the second summation is over the *n* background pixels.

The variance in is From Poisson statistics this becomes where is the background summed over all peak pixels. We can also write (this is only strictly true if the background region has *mm* symmetry). Then This expression shows the importance of the background in determining the standard deviation of the intensity. For weak reflections, the Bragg intensity is often much smaller than the background , and the error in the intensity is determined entirely by the background contribution.

Standard-deviation estimates calculated using (11.2.5.11) are generally in quite good agreement with observed differences between the intensities of symmetry-related reflections for weak or medium intensities. This is particularly true if other sources of systematic error are minimized by measuring the *same* reflections five or more times, by doing multiple exposures of the same small oscillation range and then processing the data in space group *P*1. However, even in this latter case, the agreement between strong intensities is significantly worse than that predicted using equation (11.2.5.11). This is consistent with the observation that it is very unusual to obtain merging *R* factors lower than 0.01, even for very strong reflections where Poisson statistics would suggest merging *R* factors should be in the range 0.002–0.003.

An experiment in which a diffraction spot recorded on photographic film was scanned many times on an optical microdensitometer showed that the r.m.s. variation in individual pixel values between the scans was greatest for those pixels immediately surrounding the centre of the spot, where the gradient of the optical density was greatest. One explanation for this observation is that these optical densities will be most sensitive to small errors in positioning the reading head, due to vibration or mechanical defects. A simple model for the instrumental contribution to the standard deviation of the spot intensity is obtained by introducing an additional term for each pixel in the spot peak: where is the average gradient and *K* is a proportionality constant. Taking a triangular reflection profile, the gradient and integrated intensity are related by where *x* is the half-width of the reflection (in pixels).

Writing gives where the factor *A* allows for differences in spot size and *K* is, ideally, a constant for a given instrument.

The total variance in the integrated intensity is then A value for *K* can be determined by comparing the goodness-of-fit of the standard profiles to individual reflection profiles (of fully recorded reflections) with that calculated from combined Poisson statistics and the instrument error term. Standard deviations estimated using (11.2.5.17) give much more realistic estimates than those based on (11.2.5.11), even for data collected with charge-coupled-device (CCD) detectors where the physical model for the source of the error is clearly not appropriate.

Providing the background and peak regions are correctly defined, summation integration provides a method for evaluating integrated intensities that is both robust and free from systematic error. For weak reflections, however, many of the pixels in the peak region will contain very little signal (Bragg intensity) but will contribute significantly to the noise because of the Poissonian variation in the background [as shown by the term in equation (11.2.5.11)]. Profile fitting provides a means of improving the signal-to-noise ratio for this class of reflection (but will provide no improvement for reflections where the background level is negligible).

In order to apply profile-fitting methods, the first requirement is to derive a `standard' profile that accurately represents the true reflection profile. Although analytical functions can be used, it is difficult to define a simple function that will cope adequately with the wide variation in spot shapes that can arise in practice. Most programs therefore rely on an empirical profile derived by summing many different spots. The optimum profile is that which provides the best fit to all the contributing reflections, *i.e.* that which minimizes where is the profile value for the pixel, is the observed background-corrected count at that pixel for reflection *h*, is a scale factor and is a weight for the pixel of reflection *h.* The summation extends over all reflections contributing to the profile. The weight is given by and from Poisson statistics is the expectation value of the counts at pixel *j*, and is given by After Rossmann (1979), the summation integration intensity can be used to derive a value for : In equations (11.2.6.3) and (11.2.6.4), as the profile values are not yet determined, a preliminary profile derived, for example, from simple summation of strong reflections used in the detector-parameter refinement can be used, which will give acceptable weights for use in equation (11.2.6.1).

This method of deriving the standard profile is only appropriate for fully recorded reflections. However, in many cases there will be very few or no fully recorded reflections on each image. In such cases the profile is determined by simply adding together the background-corrected pixel counts from all contributing reflections. In the program *MOSFLM* (Leslie, 1992), the profiles are determined using reflections on, typically, ten or more successive images, so that partials will be summed to give the correct fully recorded profile for the majority of the contributing reflections. Tests carried out using standard profiles derived using only fully recorded reflections and equation (11.2.6.1), or using both fully recorded and partially recorded reflections and simple summation, give data of the same quality as judged by the merging statistics.

The reflection profile changes across the face of the detector, due to obliquity of incidence, changes in the projected diffracting volume and geometric factors. In the *MOSFLM* program, this variation is accommodated by determining several standard profiles (typically nine or 25) for different regions of the detector. When evaluating the profile-fitted intensity for a given reflection, a weighted sum of the nearest standard profiles is calculated to provide the best estimate of the true profile at that position on the detector. For the central regions of the detector there will be four contributing profiles, while at the edges there will be between one and three. The weights assigned to each profile vary linearly with the distance from the reflection to the centres of the regions used in determining the standard profiles. An alternative procedure used in *DENZO* (Otwinowski & Minor, 1997) is to evaluate a new profile for each reflection based on spots lying within a pre-specified radius.

Given an appropriate standard profile, the reflection intensity for fully recorded reflections is evaluated by determining the scale factor *K* and background plane constants *a*, *b*, *c* which minimize where the summation is over all valid pixels in the measurement box. As before, and In order to calculate the weights, the background plane constants and summation integration intensity are evaluated as described in Section 11.2.5, at the same time identifying any outliers in the background. The summation integration intensity is used to evaluate the scale factor *J* in equation (11.2.6.7) using In equation (11.2.6.5), the summation is over all valid pixels within the measurement box. This excludes pixels that are overlapped by neighbouring spots (if any) and any outliers identified in the background region.

Minimizing with respect to *K*, *a*, *b* and *c* leads to four linear equations from which *K*, *a*, *b* and *c* can be determined: The profile-fitted intensity is then given by The standard deviation in the profile-fitted intensity is given by where *N* is the number of pixels in the summation and is the diagonal element for the scale factor *K* of the inverse normal matrix (used to minimize ).

In the case of partially recorded reflections, it is no longer valid to fit the sum of the scaled standard profile and a background plane to all pixels in the measurement box. Partially recorded reflections can have a profile that differs significantly from the standard profile, with the result that the background plane constants take on physically unreasonable values in an attempt to compensate for this difference. Therefore, for partially recorded reflections, the summation in equation (11.2.6.5) is restricted to pixels in the peak region of the measurement box. Minimizing with respect to the scale factor *K* then gives where all summations are over the peak region only.

It is not possible to derive a standard deviation for partially recorded reflections based on the fit of the scaled standard profile (because partially recorded reflections have a different spot profile). For these reflections, the standard deviation can be calculated using equation (11.2.5.17).

In order to apply equation (11.2.6.5), it is necessary to exclude all pixels in the measurement box that are overlapped by a neighbouring spot. This applies not only to the pixels of the reflection being integrated, but also to the pixels of all the reflections used to form the standard profile. Consequently, a pixel should be excluded even if it is only overlapped by a neighbouring spot for one of the reflections used in forming the standard profile. When processing data from large unit cells, this can lead to a very high percentage of the background pixels being rejected and therefore a poor determination of the background plane parameters. In these circumstances, the background plane is determined using only background pixels and excluding only those pixels that are overlapped by neighbours for the reflection actually being integrated. The profile-fitted intensity for both fully recorded and partially recorded reflections is then evaluated in the way described for partially recorded reflections in Section 11.2.6.2, with the summation in equation (11.2.6.15) extending only over peak pixels. The standard deviation in the intensity for partially recorded reflections is derived from equation (11.2.5.17) as before. For fully recorded reflections, the standard deviation has two components: the first is based on the fit of the scaled standard profile to the reflection profile and the second on the contribution from the background: where *m* and *n* are the number of pixels in the peak and background, respectively.

For very strong reflections, the background level is very small and equation (11.2.6.15) reduces to and the weights are given by Substituting for in (11.2.6.18) gives As pointed out by Z. Otwinowski (personal communication), this shows that for correctly weighted profile fitting, the profile-fitted intensity reduces to the summation integration intensity for very strong intensities.

For very weak reflections, all pixels will have very similar counts and therefore all the weights will be the same. For simplicity, consider the case where the profile fit is evaluated only for the peak pixels, then equation (11.2.6.15) reduces to The second and third summations in this equation depend only on the shape of the standard profile. This shows that the intensity is a weighted sum of the individual background-corrected pixel counts (rather than a simple unweighted sum, as is the case for summation integration). Because the values of are a maximum in the centre of the spot, this will place a higher weight on those pixels where the contribution of the Bragg diffraction is greatest, and a very low weight on the peripheral pixels where the Bragg diffraction is weakest. In this way, profile fitting improves the signal-to-noise ratio without the risk of introducing any systematic error that may result by simply reducing the size of the peak region for weak spots.

For very weak reflections, where all the weights are approximately the same, the variance in using equation (11.2.6.21) is given by Assuming a flat background and very weak intensity, then from Poisson statistics and as has approximately the same value for all pixels, The variance in the summation integration intensity is simply The ratio of the variances is thus For a typical spot profile, the right-hand side (which depends only on the shape of the standard profile) has a value of 2, showing that profile fitting can reduce the standard deviation in the integrated intensity by a factor of .

If adjacent spots are not fully resolved, there will be a systematic error in the integrated intensity which will be largest for weak spots that are adjacent to very strong spots. However, the profile-fitted intensity will be affected less than the summation integration intensity, because the peripheral pixels (where the influence of neighbouring spots is greatest) are down-weighted relative to the central pixels (where the neighbours will have least influence).

Further steps can be taken to minimize the errors caused by overlapping spots. Firstly, when forming the standard profiles, reflections are only included if they are significantly stronger than their nearest neighbours. This will minimize the errors in the standard profiles. Secondly, when evaluating the profile-fitted intensity of a particular reflection, pixels can be omitted if they are adjacent to a pixel that is part of a neighbouring spot (rather than having to be part of that spot).

In the same way that outliers in the background region can be identified and rejected (see Section 11.2.5.1.1), it is possible in principle to identify outliers in the peak region of fully recorded reflections as those pixels whose deviation from the scaled standard profile is significantly greater than that expected from counting statistics. This approach works well if the feature that gives rise to the outliers affects only a small fraction of the peak pixels and gives rise to large deviations, and this is the case for some zingers or dead pixels, and for diffraction from small ice crystals when collecting data from cryo-cooled samples.

Another source of outliers is the encroachment of a strong neighbouring spot into the peak region, as discussed in Section 11.2.6.7.1. When dealing with peripheral pixels, the outlier test can be applied to both fully recorded and partially recorded reflections, but a high σ cutoff (*e.g.* 10–20) must be used to avoid rejecting pixels that do not fit the profile simply because they correspond to a partially recorded spot.

Owing to the limited dynamic range of current detectors, it is common for many low-resolution spots to contain saturated pixels. Providing the saturation level of the detector is known, such pixels can simply be excluded from the profile fitting, allowing a reasonable estimate of the true intensity (except when the majority of the pixels are saturated). A knowledge of the strong intensities is essential for structure solution based on molecular replacement techniques, and so this is a very useful additional feature of profile fitting.

Greenhough & Suddath (1986) have shown that when profile fitting is applied to partially recorded reflections this leads to a systematic error in the individual intensities, but there is no systematic error in the total summed intensity. Although their analysis is strictly only applicable to the case of unweighted profile fitting, experience has shown that even when using weighted profile fitting there is no evidence of systematic errors in the summed profile-fitted intensities of partially recorded reflections. This is particularly important as many data sets collected from frozen crystals have few, if any, fully recorded reflections.

The fundamental assumption in profile fitting is that the standard profiles accurately reflect the true profile of the reflection being integrated. Errors in the standard profile will result in systematic errors in the profile-fitted intensities. While these errors will often be small compared to the random (Poissonian) error for weak reflections, this is not necessarily the case for strong reflections, as the systematic error is typically a small percentage of the total intensity. Because the standard profiles are derived from the summation of many contributing reflections, small positional errors in spot prediction will lead to a broadening of the standard profile relative to the profile of an individual spot. The same broadening can occur because of the finite sampling interval in the image, which means that a predicted spot position can lie up to half a pixel away from the centre of the measurement box. This error can be minimized by interpolating the pixel values in the image onto a grid which is centred exactly on the predicted position, but the interpolation step itself will inevitably distort the reflection profile. In spite of these difficulties, providing adequate care is taken to determine the crystal and detector parameters accurately (as mentioned in Section 11.2.2), so that the spot positions are predicted to within a small fraction of the overall spot width, there is no suggestion (from merging statistics at least) for significant systematic error, even in the stronger intensities.

### Acknowledgements

I would like to thank Dr A. J. Wonacott, Dr P. Brick and Dr P. R. Evans for many stimulating and critical discussions on all aspects of data integration.

### References

Diamond, R. (1969).*Profile analysis in single crystal diffractometry*.

*Acta Cryst.*A

**25**, 43–55.

Ford, G. C. (1974).

*Intensity determination by profile fitting applied to precession photographs*.

*J. Appl. Cryst.*

**7**, 555–564.

Greenhough, T. J. & Suddath, F. L. (1986).

*Oscillation camera data processing. 4. Results and recommendations for the processing of synchrotron radiation data in macromolecular crystallography*.

*J. Appl. Cryst.*

**19**, 400–409.

Lehmann, M. S. & Larsen, F. K. (1974).

*A method for location of the peaks in step-scan-measured Bragg reflexions*.

*Acta Cryst.*A

**30**, 580–584.

Leslie, A. G. W. (1992).

*Recent changes to the MOSFLM package for processing film and image plate data.*

*CCP4 and ESF-EACMB Newsletter on Protein Crystallography*. Warrington: Daresbury Laboratory.

Otwinowski, Z. & Minor, W. (1997).

*Processing of X-ray diffraction data collected in oscillation mode.*

*Methods Enzymol.*

**276**, 307–326.

Rossmann, M. G. (1979).

*Processing oscillation diffraction data for very large unit cells with an automatic convolution technique and profile fitting*.

*J. Appl. Cryst.*

**12**, 225–238.

Rossmann, M. G., Leslie, A. G. W., Abdel-Meguid, S. S. & Tsukihara, T. (1979).

*Processing and post-refinement of oscillation camera data*.

*J. Appl. Cryst.*

**12**, 570–581.

Winkler, F. K., Schutt, C. E. & Harrison, S. C. (1979).

*The oscillation method for crystals with very large unit cells*.

*Acta Cryst.*A

**35**, 901–911.