InternationalCrystallography of biological macromoleculesTables for Crystallography Volume F Edited by E. Arnold, D. M. Himmel and M. G. Rossmann © International Union of Crystallography 2012 |
International Tables for Crystallography (2012). Vol. F, ch. 14.3, pp. 379-383
https://doi.org/10.1107/97809553602060000846 ## Chapter 14.3. Automated MAD and MIR structure solution In this chapter, the |

In favourable cases, structure solution by X-ray crystallography using the multiwavelength anomalous diffraction (MAD) or multiple isomorphous replacement (MIR) methods can be a straightforward, though often lengthy, process. The recently developed *Solve* software (Terwilliger & Berendzen, 1999*b*) is designed to fully automate this class of structure solution. The overall approach is to link together all the analysis steps that a crystallographer would normally carry out into a seamless procedure, and in the process to convert each decision-making step into an optimization problem.

In the case of both MAD and MIR data, a key element of the procedure is the scoring and ranking of possible solutions. This scoring procedure makes it possible to treat structure solution as an optimization procedure, rather than a decision-making one. In the case of MAD data, a second key element of the procedure is the conversion of MAD data to a pseudo-SIRAS form (Terwilliger, 1994*b*) that allows much more rapid analysis than one involving the full MAD data set.

The MAD and MIR approaches to structure solution are conceptually very similar and share several important steps. Two of these are the identification of possible locations of heavy or anomalously scattering atoms and an analysis of the quality of each of these potential heavy-atom solutions. In each method, trial partial structures for these heavy or anomalously scattering atoms are often obtained by inspection of difference Patterson functions or by semi-automated analysis (*e.g.* Terwilliger *et al.*, 1987; Chang & Lewis, 1994; Vagin & Teplyakov, 1998). In other cases, direct-methods approaches have been used to find heavy-atom sites (Sheldrick, 1990; Miller *et al.*, 1994). Potential heavy-atom solutions found in any of these approaches are often just a starting point for structure solution, with additional sites found by difference Fourier or other approaches.

The analysis of the quality of potential heavy-atom solutions is also very similar in the MIR and MAD methods. In both cases a partial structure is used to calculate native phases for the entire structure, and the electron density that results is examined to see if the expected features of the macromolecule are found. Additionally, the agreement of the heavy-atom model with the difference Patterson function and the figure of merit of phasing are commonly used to evaluate the quality of a solution. In many cases, an analysis of heavy-atom sites by sequential deletion of individual sites or derivatives is often an important criterion of quality as well (Dickerson *et al.*, 1961).

The process of structure solution can be thought of largely as a decision-making process. In the early stages of solution, a crystallographer must choose which of several potential trial solutions may be worth pursuing. At a later stage, the crystallographer must choose which peaks in a heavy-atom difference Fourier are to be included in the heavy-atom model, and which hand of the solution is correct. At a final stage, the crystallographer must decide whether the solution process is complete and which of the possible heavy-atom models is the best. The most important feature of the *Solve* software is the use of a consistent scoring algorithm as the basis for making all these decisions.

In order to make automated structure solution practical, it was necessary to be able to evaluate heavy-atom solutions very rapidly. This is because the automated approach used by *Solve* requires analysis of many heavy-atom solutions (typically 300–1000). For each heavy-atom solution examined, the heavy-atom sites have to be refined and phases calculated. In implementing automated structure solution, it was important to recognize the need for a trade-off between the most accurate heavy-atom refinement and phasing at all stages of structure solution and the time required to carry it out. The balance chosen for *Solve* was to use the most accurate available methods for final phase calculations, and to use approximate but much faster methods for all refinements and phase calculations. The refinement method chosen on this basis was origin-removed Patterson refinement (Terwilliger & Eisenberg, 1983), which treats each derivative in an MIR data set independently and which is very fast because it does not require phase calculation. The phasing approach used for MIR data thoughout *Solve* is Bayesian correlated phasing (Terwilliger & Berendzen, 1996; Terwilliger & Eisenberg, 1987), which takes into account the correlation of non-isomorphism among derivatives without substantially slowing down phase calculations.

For MAD data, Bayesian calculations of phase probabilities are very slow (*e.g.* Terwilliger & Berendzen, 1997; de La Fortelle & Bricogne, 1997). Consequently, we have used an alternative procedure for all MAD phase calculations except those done at the very final stage. This alternative is to convert the MAD data set into a form that is similar to one obtained in the single isomorphous replacement with anomalous scattering (SIRAS) method. In this way, a single data set with isomorphous and anomalous differences is obtained that can be used in heavy-atom refinement by the origin-removed Patterson refinement method and in phasing by conventional SIRAS phasing (Terwilliger & Eisenberg, 1987).

The conversion of MAD data to a pseudo-SIRAS form that has almost the same information content requires two important assumptions. The first assumption is that the structure factor corresponding to anomalously scattering atoms in a structure varies in magnitude but not in phase at various X-ray wavelengths. This assumption will hold when there is one dominant type of anomalously scattering atom. The second is that the structure factor corresponding to anomalously scattering atoms is small compared to the structure factor from all other atoms. As long as these two assumptions hold, the information in a MAD experiment is largely contained in just three quantities: a structure factor () corresponding to the scattering from non-anomalously scattering atoms, a dispersive or isomorphous difference at a standard wavelength (), and an anomalous difference () at the same standard wavelength (Terwilliger, 1994*b*). It is easy to see that these three quantities could be treated just like a SIRAS data set with the `native' structure factor replaced by , the derivative structure factor replaced by , and the anomalous difference replaced by (Terwilliger, 1994*b*). This is the approach taken by *Solve*. In this section, it is briefly shown how these three quantities can be estimated from MAD data.

For a particular reflection and a particular wavelength , we can write the total normal (*i.e.*, non-anomalous) scattering from a structure () as the sum of two components. One is the scattering from all non-anomalously scattering atoms (). This scattering is wavelength-independent. The second is the normal scattering from anomalously scattering atoms () at wavelength . This term includes wavelength-dependent dispersive shifts in atomic scattering due to the *f*′ term in the scattering factor, but not the anomalous part due to the *f*″ term. The magnitude of the total scattering factor can then be written in the form Here and can be thought of corresponding, respectively, to the native structure factor, , and the derivative structure factor, , as used in the method of isomorphous replacement (Blundell & Johnson, 1976). If the scattering from anomalously scattering atoms is small compared to that from all other atoms, equation (14.3.5.1) can be rewritten in the approximate form where α is the phase difference between the structure factors corresponding to non-anomalously and anomalously scattering atoms in the unit cell, and , respectively, at this X-ray wavelength.

The data in a MAD experiment consist of observations of structure-factor amplitudes for Bijvoet pairs, and , for several X-ray wavelengths . These can be rewritten in terms of an average structure-factor amplitude and an anomalous difference (*cf.* Blundell & Johnson, 1976). We would like to convert these into estimates of the amplitude of the structure factor corresponding to the non-anomalously scattering atoms alone, the amplitude of the structure factor corresponding to the entire structure at a standard wavelength, and the anomalous difference at the standard wavelength.

The normal scattering due to anomalously scattering atoms () changes in magnitude but not direction as a function of X-ray wavelength. We can therefore write (Terwilliger, 1994*b*) where is an X-ray wavelength arbitrarily defined as a standard, and the real part of the scattering factor for the anomalously scattering atoms at wavelength is . A corresponding approximation for the anomalous differences at various wavelengths can also be written (Terwilliger & Eisenberg, 1987) where is the imaginary part of the scattering factor for the anomalously scattering atoms at wavelength . Based on equation (14.3.5.4), anomalous differences at any wavelength can be estimated using measurements at the standard wavelength.

An estimate of the structure-factor amplitude () corresponding to the scattering from non-anomalously scattering atoms and of the dispersive difference at standard wavelength () can be obtained from average structure-factor amplitudes () at any pair of wavelengths and by proceeding in two steps. Using equations (14.3.5.2) and (14.3.5.3), the component of along , which we term , can be estimated as or Then, in turn, this estimate of can be used to obtain : This set of , and can then be used just as , and are used in the SIRAS (single isomorphous replacement with anomalous scattering) method.

The algorithm described above is implemented in the program segment *MADMRG* as part of *Solve* (Terwilliger, 1994*b*). In most cases, there are more than one pair of X-ray wavelengths corresponding to a particular reflection. The estimates from each pair of wavelengths are averaged, using weighting factors based on the uncertainties in each estimate. Data from various pairs of X-ray wavelengths and from various Bijvoet pairs can have very different weights in their contributions to the total. This can be understood by noting that pairs of wavelengths that yield a large value of the denominator in equation (14.3.5.6) (*i.e.*, those that differ considerably in dispersive contributions) would yield relatively accurate estimates of . In the same way, Bijvoet differences measured at the wavelength with the largest value of *f*″ will contribute the most to estimates of .

The standard wavelength choice in this analysis is arbitrary, because values at any wavelength can be converted to values at any other wavelength. The standard wavelength does not even have to be one of the wavelengths in the experiment, though it is convenient to choose one of them.

Scoring of potential heavy-atom solutions is an essential part of the *Solve* algorithm because it allows ranking of solutions and appropriate decision making. *Solve* scores trial heavy-atom solutions (or anomalously scattering atom solutions) using four criteria: agreeement with the Patterson function, cross-validation of heavy-atom sites, figure of merit, and non-randomness of the electron-density map. The scores for each criterion are normalized to those for a group of starting solutions (most of which are incorrect) to obtain *Z* scores. The total score for a solution is the sum of its *Z* scores after correction for anomalously high scores in any category.

The first criterion used by *Solve* for evaluating a trial heavy-atom solution is the agreement between calculated and observed Patterson functions. Comparisons of this type have always been important in the MIR and MAD methods (Blundell & Johnson, 1976). The score for Patterson-function agreement is the average value of the Patterson function at predicted locations of peaks, after multiplication by a weighting factor based on the number of heavy-atom sites in the trial solution. The weighting factor (Terwilliger & Berendzen, 1999*b*) is adjusted so that if two solutions have the same mean value at predicted Patterson peaks, the one with the larger numbers of sites receives the higher score. Typically the weighting factor is approximately given by , where there are *N* sites in the solution.

In some cases, predicted Patterson vectors fall on high peaks that are not related to the heavy-atom solution. To exclude these contributions, occupancies of each heavy-atom site are refined so that the predicted peak heights approximately match the observed peak heights at the predicted interatomic positions. Then all peaks with heights more than 1σ higher than their predicted values are truncated at this height. The average values are further corrected for instances where more than one predicted Patterson vector falls on the same location by scaling that peak height by the fraction of predicted vectors that are unique.

A `cross-validation' difference Fourier analysis is the basis of the second criterion used to evaluate heavy-atom solutions. One at a time, each site in a solution (and any equivalent sites in other derivatives for MIR solutions) is omitted from the heavy-atom model and phases are recalculated. These phases are used in a difference Fourier analysis and the peak height at the location of the omitted site is noted. A similar analysis where a derivative is omitted from phasing and all other derivatives are used to phase a difference Fourier has been used for many years (Dickerson *et al.*, 1961). The score for cross-validation difference Fouriers is the average peak height, after weighting by the same factor used in the difference Patterson analysis.

The mean figure of merit of phasing (*m*) (Blundell & Johnson, 1976) can be a remarkably useful measure of the quality of phasing despite its susceptibility to systematic error (Terwilliger & Berendzen, 1999*b*). The overall figure of merit is essentially a measure of the internal consistency of the heavy-atom solution and the data, and is used as the third criterion for solution quality in *Solve*. As heavy-atom refinement in *Solve* is carried out using origin-removed Patterson refinement (Terwilliger & Eisenberg, 1983), occupancies of heavy-atom sites are relatively unbiased. This minimizes the problem of high occupancies leading to inflated figures of merit. Additionally, using a single procedure for phasing allows comparison between solutions. The score based on figure of merit is simply the unweighted mean for all reflections included in phasing.

The most important criterion used by a crystallographer in evaluating the quality of a heavy-atom solution is the interpretability of the resulting electron-density map. Although a full implementation of such a criterion is difficult, it is quite straightforward to evaluate instead whether the electron-density map has features that are expected for a crystal of a macromolecule. A number of features of electron-density maps could be used for this purpose, including the connectivity of electron density in the maps (Baker *et al.*, 1993), the presence of clearly defined regions of protein and solvent (Wang, 1985; Podjarny *et al.*, 1987; Zhang & Main, 1990; Xiang *et al.*, 1993; Abrahams *et al.*, 1994; Terwilliger & Berendzen, 1999*a*,*c*), and histogram matching of electron densities (Zhang & Main, 1990; Goldstein & Zhang, 1998). We have used the identification of solvent and protein regions as the measure of map quality in *Solve*. This requires that there be both solvent and protein regions in the electron-density map, but for most macromolecular structures the fraction of the unit cell that is occupied by the macromolecule is in the suitable range of 30–70%. The criterion used in scoring by *Solve* is based on the connectivity of the solvent and protein regions (Terwilliger & Berendzen, 1999*c*). The unit cell is divided into boxes approximately twice the resolution of the map on a side, and within each box the r.m.s. electron density is calculated, without including the term in the Fourier synthesis. For boxes within the protein region, this r.m.s. electron density will typically be high (as there are some points where atoms are located and other points between atoms), while for those in the solvent region it will be low (as the electron density is fairly uniform). The score based on the connectivity of the protein and solvent regions is simply the correlation coefficient of this r.m.s. electron density for adjacent boxes. If there is a large contiguous protein region and a large contiguous solvent region, then adjacent boxes will have highly correlated values of their r.m.s. electron densities. If the electron density is random, there will be little or no correlation. In practice, for a very good electron-density map, this correlation of local r.m.s. electron density may be as high as 0.5 or 0.6.

The four-point scoring scheme described above provides the foundation for automated structure solution. To make it practical, the conversion of MAD data to a pseudo-SIRAS form and the use of rapid origin-removed Patterson-based heavy-atom refinement has been nearly essential. The remainder of the *Solve* algorithm for automated structure solution is largely a standardized form of local scaling, an integrated set of routines to carry out all of the calculations required for heavy-atom searching, refinement and phasing, and routines to keep track of the lists of current solutions being examined and past solutions that have already been tested.

Scaling of data in the *Solve* algorithm is done by a local scaling procedure (Matthews & Czerwinski, 1975). Systematic errors are minimized by scaling and , native and derivative, and wavelengths of MAD data in very similar ways and by keeping different data sets separate until the end of scaling. The scaling procedure is optimized for cases where the data are collected in a systematic fashion. For both MIR and MAD data, the overall procedure is to construct a reference data set that is as complete as possible and that contains information either from a native data set (for MIR) or for all wavelengths (for MAD data). This reference data set is constructed for just the asymmetric unit of data and is essentially the average of all measurements obtained for each reflection. The reference data set is then expanded to the entire reciprocal lattice and used as the basis for local scaling of each individual data set [see Terwilliger & Berendzen (1999*b*) for additional details].

Once MIR data have been scaled, or MAD data have been scaled and converted to a pseudo-SIRAS form, difference Patterson functions are used to identify plausible one-site or two-site heavy-atom solutions. For MIR data, difference Patterson functions are calculated for each derivative. For MAD data, anomalous and dispersive differences are combined to yield a Bayesian estimate of the Patterson function for the anomalously scattering atoms (Terwilliger, 1994*a*). An automated search of the Patterson function is then used to find a large number (typically 30) of potential single-site and two-site solutions. In principle, Patterson methods could be used to solve the complete heavy-atom substructure, but the approach used in *Solve* is to find just the first one or two heavy-atom sites in this way and to find all others by difference Fourier analysis. This initial set of one-site and two-site solutions becomes the initial list of potential solutions (`seeds') for automated structure solution. Once each of the potential seeds is scored and ranked, the top seeds (typically five) are selected as independent starting points for the search for heavy-atom solutions.

For each starting solution (seed), the main cycle in the automated structure-solution algorithm used by *Solve* consists of two basic steps. The first is to refine heavy-atom parameters and rank all existing solutions generated so far from this seed based on the four criteria discussed above. The second is to take the highest-ranking solution that has not yet been exhaustively analysed and use it in an attempt to generate a more complete solution. Generation of new solutions is carried out in three ways: by deletion of sites, by addition of sites from difference Fouriers and by inversion. A partial solution is considered to have been exhaustively analysed when all single-site deletions have been considered, when no more peaks in a difference Fourier can be found that improve upon it, and when inversion does not improve it, or when the maximum number of sites input by the user has been reached. In each case, new solutions generated in these three ways are refined, scored and ranked, and the cycle is continued until all the top solutions have been fully analysed and no new solutions are found. Throughout this process, a tally of the solutions that have already been considered is kept, and any time a solution is a duplicate of a previously examined solution it is dropped.

In some cases, one very clear solution appears early in the structure-solution process, while in others, there are several solutions that have similar scores at early (and sometimes even late) stages of structure solution. In cases where no one solution is much better than the others, all the seeds are exhaustively analysed. On the other hand, if a very promising solution emerges from one seed, then the search is narrowed to focus on that seed, deletions are not carried out until the end of the analysis, and many peaks from the difference Fourier analysis are added at a time so as to build up the solution as quickly as possible. Once the expected number of heavy-atom sites are found, then each site is deleted in turn to see if the solution can be further improved. If this occurs, then the new solutions are analysed in the same way by addition and deletion of sites and by inversion until no improvement is obtained.

At the conclusion of the *Solve* algorithm, an electron-density map and phases for the top solution are reported in a form that is compatible with the *CCP4* suite (Collaborative Computational Project, Number 4, 1994). Additionally, command files that can be modified to look for additional heavy-atom sites or to construct other electron-density maps are produced. If more than one possible solution is found, the heavy-atom sites and phasing statistics for all of them are reported.

An important feature of *Solve* is the inclusion of modules for the generation of model data. *Solve* can construct model raw X-ray data for either MIR or MAD cases. The macromolecular structure can be defined by a file in PDB format (Bernstein *et al.*, 1977) with heavy-atom parameters defined by the user. Any degree of `experimental' uncertainty in measurement of intensities can be included, and limited non-isomorphism for MIR data in which cell dimensions differ for native and any of the derivative data sets (but in which the macromolecular structure is identical) can be included. This automatic generation of model data is very useful in evaluating what can and what cannot be solved. Once a data set has been generated, the *Solve* algorithm can be used to attempt to solve it. *Solve* generates a model electron-density map based on the input coordinates, and during the structure-solution process all maps calculated with trial solutions can be compared to the model map. In many cases, heavy-atom solutions can be related to different origins (and to different handedness as well). The origin shift is identified by *Solve* by finding the shift that best maps the trial solution onto the (known) correct solution.

The *Solve* algorithm is very useful for solving macromolecular structures by the MIR and MAD methods. It has been used to solve MAD structures with as many as 56 selenium atoms in the asymmetric unit (W. Smith & C. Janson, personal communication). From the user's point of view, the algorithm is very simple. Only a few input parameters are needed in most cases, and the *Solve* algorithm carries out the entire process automatically. In principle, the procedure can be very thorough as well, so that many trial starting solutions can be examined and difficult heavy-atom structures can be found. Additionally, for the most difficult structure-solution cases, the failure to find a solution can be useful in confirming that additional information is needed.

The *Solve* software and complete documentation can be obtained from the web site http://solve.lanl.gov/
.

### Acknowledgements

TCT and JB gratefully acknowledge support from the National Institutes of Health and the US Department of Energy.

### References

Abrahams, J. P., Leslie, A. G. W., Lutter, R. & Walker, J. E. (1994).*Structure at 2.8-angstrom resolution of f1-ATPase from bovine heart-mitochondria. Nature (London)*,

**370**, 621–628.

Baker, D., Krukowski, A. E. & Agard, D. A. (1993).

*Uniqueness and the ab initio phase problem in macromolecular crystallography. Acta Cryst.*D

**49**, 186–192.

Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977).

*Protein data bank: computer-based archival file for macromolecular structures. J. Mol. Biol.*

**112**, 535–542.

Blundell, T. L. & Johnson, L. N. (1976).

*Protein Crystallography*, p. 368. New York: Academic Press.

Chang, G. & Lewis, M. (1994).

*Using genetic algorithms for solving heavy-atom sites. Acta Cryst.*D

**50**, 667–674.

Collaborative Computational Project, Number 4 (1994).

*The CCP4 suite: programs for protein crystallography. Acta Cryst.*D

**50**, 760–763.

Dickerson, R. E., Kendrew, J. C. & Strandberg, B. E. (1961).

*The crystal structure of myoglobin: phase determination to a resolution of 2 Å by the method of isomorphous replacement. Acta Cryst.*

**14**, 1188–1195.

Goldstein, A. & Zhang, K. Y. J. (1998).

*The two-dimensional histogram as a constraint for protein phase improvement. Acta Cryst.*D

**54**, 1230–1244.

La Fortelle, E. de & Bricogne, G. (1997).

*Maximum-likelihood heavy-atom parameter refinement for multiple isomorphous replacement and multiwavelength anomalous diffraction methods. Methods Enzymol.*

**276**, 472–494.

Matthews, B. W. & Czerwinski, E. W. (1975).

*Local scaling: a method to reduce systematic errors in isomorphous replacement and anomalous scattering measurements. Acta Cryst.*A

**31**, 480–487.

Miller, R., Gallo, S. M., Khalak, H. G. & Weeks, C. M. (1994).

*SnB: crystal structure determination via shake-and-bake. J. Appl. Cryst.*

**27**, 613–621.

Podjarny, A. D., Bhat, T. N. & Zwick, M. (1987).

*Improving crystallographic macromolecular images: the real-space approach. Annu. Rev. Biophys. Biophys. Chem.*

**16**, 351–373.

Sheldrick, G. M. (1990).

*Phase annealing in SHELX-90: direct methods for larger structures. Acta Cryst.*A

**46**, 467–473.

Terwilliger, T. C. (1994

*a*).

*MAD phasing: Bayesian estimates of*.

*Acta Cryst.*D

**50**, 11–16.

Terwilliger, T. C. (1994

*b*).

*MAD phasing: treatment of dispersive differences as isomorphous replacement information. Acta Cryst.*D

**50**, 17–23.

Terwilliger, T. C. & Berendzen, J. (1996).

*Correlated phasing of multiple isomorphous replacement data. Acta Cryst.*D

**52**, 749–757.

Terwilliger, T. C. & Berendzen, J. (1997).

*Bayesian correlated MAD phasing. Acta Cryst.*D

**53**, 571–579.

Terwilliger, T. C. & Berendzen, J. (1999

*a*).

*Discrimination of solvent from protein regions in native Fouriers as a means of evaluating heavy-atom solutions in the MIR and MAD methods. Acta Cryst.*D

**55**, 501–505.

Terwilliger, T. C. & Berendzen, J. (1999

*b*).

*Automated MIR and MAD structure solution. Acta Cryst.*D

**55**, 849–861.

Terwilliger, T. C. & Berendzen, J. (1999

*c*).

*Evaluation of macromolecular electron-density map quality using the correlation of local r.m.s. density. Acta Cryst.*D

**55**, 1872–1877.

Terwilliger, T. C. & Eisenberg, D. (1983).

*Unbiased three-dimensional refinement of heavy-atom parameters by correlation of origin-removed Patterson functions. Acta Cryst.*A

**39**, 813–817.

Terwilliger, T. C. & Eisenberg, D. (1987).

*Isomorphous replacement: effects of errors on the phase probability distribution. Acta Cryst.*A

**43**, 6–13.

Terwilliger, T. C., Kim, S.-H. & Eisenberg, D. (1987).

*Generalized method of determining heavy-atom positions using the difference Patterson function. Acta Cryst.*A

**43**, 1–5.

Vagin, A. & Teplyakov, A. (1998).

*A translation-function approach for heavy-atom location in macromolecular crystallography. Acta Cryst.*D

**54**, 400–402.

Wang, B.-C. (1985).

*Resolution of phase ambiguity in macromolecular crystallography. Methods Enzymol.*

**115**, 90–112.

Xiang, S., Carter, C. W. Jr, Bricogne, G. & Gilmore, C. J. (1993).

*Entropy maximization constrained by solvent flatness: a new method for macromolecular phase extension and map improvement. Acta Cryst.*D

**49**, 193–212.

Zhang, K. Y. J. & Main, P. (1990).

*The use of Sayre's equation with solvent flattening and histogram matching for phase extension and refinement of protein structures. Acta Cryst.*A

**46**, 377–381.