International
Tables for
Crystallography
Volume F
Crystallography of biological macromolecules
Edited by M. G. Rossmann and E. Arnold

International Tables for Crystallography (2006). Vol. F, ch. 25.2, pp. 695-743
https://doi.org/10.1107/97809553602060000724

Chapter 25.2. Programs and program systems in wide use

W. Furey,a K. D. Cowtan,b K. Y. J. Zhang,c P. Main,d A. T. Brunger,v P. D. Adams,e W. L. DeLano,f P. Gros,g R. W. Grosse-Kunstleve,e J.-S. Jiang,h N. S. Pannu,i R. J. Read,j L. M. Rice,k T. Simonson,l D. E. Tronrud,m L. F. Ten Eyck,y V. S. Lamzin,n A. Perrakis,o K. S. Wilson,p R. A. Laskowski,w M. W. MacArthur,q J. M. Thornton,x P. J. Kraulis,r D. C. Richardson,s J. S. Richardson,s W. Kabscht and G. M. Sheldricku

aBiocrystallography Laboratory, VA Medical Center, PO Box 12055, University Drive C, Pittsburgh, PA 15240, USA, and Department of Pharmacology, University of Pittsburgh School of Medicine, 1340 BSTWR, Pittsburgh, PA 15261, USA,bDepartment of Chemistry, University of York, York YO1 5DD, England,cDivision of Basic Sciences, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N., Seattle, WA 90109, USA,dDepartment of Physics, University of York, York YO1 5DD, England,eThe Howard Hughes Medical Institute and Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06511, USA,fGraduate Group in Biophysics, Box 0448, University of California, San Francisco, CA 94143, USA,gCrystal and Structural Chemistry, Bijvoet Center for Biomolecular Research, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands,hProtein Data Bank, Biology Department, Brookhaven National Laboratory, Upton, NY 11973-5000, USA,iDepartment of Mathematical Sciences, University of Alberta, Edmonton, Alberta, Canada T6G 2G1,jDepartment of Haematology, University of Cambridge, Wellcome Trust Centre for Molecular Mechanisms in Disease, CIMR, Wellcome Trust/MRC Building, Hills Road, Cambridge CB2 2XY, England,kDepartment of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT 06511, USA,lLaboratoire de Biologie Structurale (CNRS), IGBMC, 1 rue Laurent Fries, 67404 Illkirch (CU de Strasbourg), France,mHoward Hughes Medical Institute, Institute of Molecular Biology, 1229 University of Oregon, Eugene, OR 97403-1229, USA,nEuropean Molecular Biology Laboratory (EMBL), Hamburg Outstation, c/o DESY, Notkestr. 85, 22603 Hamburg, Germany,oEuropean Molecular Biology Laboratory (EMBL), Grenoble Outstation, c/o ILL, Avenue des Martyrs, BP 156, 38042 Grenoble CEDEX 9, France,pStructural Biology Laboratory, Department of Chemistry, University of York, Heslington, York YO10 5DD, England,qBiochemistry and Molecular Biology Department, University College London, Gower Street, London WC1E 6BT, England,rStockholm Bioinformatics Center, Department of Biochemistry, Stockholm University, SE-106 91 Stockholm, Sweden,sDepartment of Biochemistry, Duke University Medical Center, Durham, NC 27710-3711, USA,tMax-Planck-Institut für medizinische Forschung, Abteilung Biophysik, Jahnstrasse 29, 69120 Heidelberg, Germany,uLehrstuhl für Strukturchemie, Universität Göttingen, Tammannstrasse 4, D-37077 Göttingen, Germany,vHoward Hughes Medical Institute, and Departments of Molecular and Cellular Physiology, Neurology and Neurological Sciences, and Stanford Synchrotron Radiation Laboratory (SSRL), Stanford University, 1201 Welch Road, MSLS P210, Stanford, CA 94305, USA,wDepartment of Crystallography, Birkbeck College, University of London, Malet Street, London WC1E 7HX, England,xBiochemistry and Molecular Biology Department, University College London, Gower Street, London WC1E 6BT, England, and Department of Crystallography, Birkbeck College, University of London, Malet Street, London WC1E 7HX, England, and ySan Diego Supercomputer Center 0505, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0505, USA

Macromolecular programs and program systems in wide use are described. The chapter covers PHASES; DM/DMMULTI, software for phase improvement by density modification; the structure-determination language of the Crystallography & NMR System; the TNT refinement package; the ARP/wARP suite for automated construction and refinement of protein models; validation of protein-structure coordinates with PROCHECK; MolScript; MAGE, PROBE and kinemages; XDS; and macromolecular applications of SHELX.

25.2.1. PHASES

| top | pdf |
W. Fureya*

The program package PHASES came into being in the mid-to-late 1980s when it evolved largely from a series of independent computer programs written during the preceding decade for use in the Veterans Administration Medical Center's Biocrystallography Laboratory in Pittsburgh, PA, USA. The predecessor programs each carried out a particular task required in the processing, phasing and analysis of diffraction data from macromolecules, but these programs were usually computer, space-group and sometimes even protein specific. In addition, the programs were often poorly documented, if at all, and made use of incompatible data formats such that a battery of `conversion' programs were required for transmitting information. While this was pretty much the situation in most laboratories at the time, it nevertheless unnecessarily complicated protein structure determination, particularly by graduate students and new postdoctoral workers. To overcome these problems, the original programs were rewritten (frequently combining several programs into one), generalized for all symmetries, modified to use a simple, standardized format and extensively documented. As methodologies developed, new programs and procedures were added, graphics programs were included, and the resulting package was optimized for use on interactive graphics workstations, which were then becoming the main computing resource in most laboratories. The first `official' PHASES release was described at an American Crystallographic Association meeting (Furey & Swaminathan, 1990[link]), although versions of the package had been in local use within the Pittsburgh laboratories for the preceding four years. There have been several major releases since the first as new features and strategies were incorporated, and the package as it existed in 1996 was extensively described in a Methods in Enzymology article (Furey & Swaminathan, 1997[link]).

25.2.1.1. Overall scope of the package

| top | pdf |

The PHASES package was designed to deal with the major problem in macromolecular structure determination, i.e., phasing the diffraction data. The package is not completely comprehensive, as it excludes programs for initial data reduction (production of a unique set of structure-factor amplitudes from the raw measurements), molecular-replacement calculations (rotation and translation functions), model building (interactive graphic fitting) and complete structure refinement (restrained least squares, conjugate gradient, molecular dynamics etc.). There are already excellent programs and packages available to accomplish these tasks; instead, the PHASES package focuses on the initial phasing of diffraction data from macromolecules by heavy-atom- and anomalous-scattering-based methods. Also included are programs and procedures for phase improvement by noncrystallographic symmetry averaging, solvent flattening, phase extension and partial structure phase combination. The programs and additional procedure scripts allow one to start with unique structure-factor amplitudes for native and/or derivative data sets and generate electron-density maps and skeletons that can be utilized in popular graphics programs for chain tracing and model building. The major methods incorporated in the package are listed below and will be described in more detail later.

25.2.1.1.1. Isomorphous replacement, anomalous scattering and MAD phasing

| top | pdf |

Heavy-atom-based phasing by the methods of isomorphous replacement (Green et al., 1954[link]) and/or anomalous scattering (Pepinsky & Okaya, 1956[link]) are initiated by reading one or more `scaled' files into the program PHASIT, along with estimates of the heavy-atom or anomalous-scatterer positional, occupancy and thermal parameters. Each input file can contain either isomorphous-replacement, derivative anomalous-scattering or native anomalous-scattering data. MAD (multiple-wavelength anomalous diffraction) data are treated as both isomorphous and anomalous-scattering data, in which case one simply inputs the scattering-factor differences (real part) appropriate for the wavelengths comprising the `isomorphous' data sets and the actual scattering factors (imaginary part) appropriate for the `anomalous' data sets. All possible combinations of isomorphous-replacement data, conventional anomalous-scattering data and MAD data are allowed and can be used simultaneously during phasing.

25.2.1.1.2. Solvent flattening and negative-density truncation

| top | pdf |

Solvent flattening and negative-density truncation are carried out following the strategy developed by Wang (1985)[link]; however, a reciprocal-space equivalent of the automated solvent-masking procedure is used (Furey & Swaminathan, 1997[link]; Leslie, 1987[link]). In addition, during solvent-mask construction all density near heavy-atom sites is automatically ignored, leading to more accurate masks. The complete process is fully automated and carries out three solvent-mask iterations with at least 16 solvent flattening and phase combination cycles. Optionally, an arbitrary number of additional cycles can be carried out for phase extension. A program is provided to interactively examine or edit the solvent mask or to create the mask by hand if desired.

25.2.1.1.3. Noncrystallographic symmetry averaging

| top | pdf |

Noncrystallographic (NC) symmetry averaging cases (Rossmann & Blow, 1963[link]; Bricogne, 1974[link]) are treated in direct space by operating on `submaps', i.e. arbitrary regions in an electron-density map encompassing all of the molecules to be averaged that are unique by true crystal symmetry. Many of the averaging programs were derived from routines originally written by W. Hendrickson & J. Smith and have been described earlier (Bolin et al., 1993[link]), but they were substantially rewritten for incorporation into the PHASES package. Programs are supplied to: generate and examine the required submaps; refine the NC symmetry operators; interactively create averaging envelope masks; average density within the envelopes; convert the submaps to full-cell maps; invert the modified maps; and combine the phases with those from another source. An automated procedure is provided to carry out a specified number of averaging and phase combination cycles in addition to solvent flattening and negative-density truncation. This procedure allows for a gradual phase extension, if desired, extending by one reciprocal-lattice point in each direction for a given number of cycles.

25.2.1.1.4. Partial structure phase combination and phase extension

| top | pdf |

Several programs are included to carry out partial structure phase combination with a variety of weighting options as an aid to structure completion. If density modification (solvent flattening or negative-density truncation and/or NC symmetry averaging) is performed, then phase (and amplitude) extension is also possible by manual or automated procedures.

25.2.1.2. Design principles

| top | pdf |

The PHASES package was designed to be user-friendly with many of the programs being interactive, so that the user is prompted for all information needed. Other programs that are often run repeatedly as part of an iterative procedure are designed to execute as batch processes and are generally run from within command procedures or shell scripts. With the exception of atomic coordinate records, all user-supplied data can be input in free format. Space-group-symmetry information is given by explicitly providing a set of equivalent positions, which has the advantage of allowing non-standard space-group settings. The individual programs in the package can be run `stand alone', but are often chained together through command procedures or shell scripts. Template scripts are provided for common iterative procedures, but the package design also allows program and option sequences to be combined in many ways, facilitating methodology development by advanced users.

25.2.1.2.1. General program structure and data flow

| top | pdf |

The current package includes 44 Fortran programs and one C subroutine, with the C subroutine used only to provide an interface between the Fortran programs and standard X-Window graphics-library routines. All programs communicate only through files with a simple common format. For the major programs, memory is allocated from a single large one-dimensional array which gets partitioned as required for each problem at run time. This greatly simplifies redimensioning if needed for very large problems, since at most only two lines of code need to be changed. All source code is provided, along with compilation procedures or shell scripts appropriate for most workstations, including Silicon Graphics, Sun, IBM R6000, ESV and DEC Alpha AXP (both OSF and OpenVMS). A flow chart illustrating the major programs and data flow for common phasing procedures is given in Fig. 25.2.1.1[link].

[Figure 25.2.1.1]

Figure 25.2.1.1 | top | pdf |

Flow chart for the major phasing path encompassing native and derivative scaling, heavy-atom-based phasing, solvent flattening, negative-density truncation, and phase combination. Boxed entries represent programs while lines represent files. Optional paths for noncrystallographic symmetry averaging and phase extension are included by considering the additional programs offset from the main path by dashed lines.

25.2.1.2.2. Parameter and cumulative log files

| top | pdf |

Vital data common to nearly all calculations, such as the cell dimensions, lattice type and space-group symmetry, are entered only once in a single `parameter file'. All interactive programs prompt for the name of this file and for batch programs it is to be supplied on the first input line. The parameter file can also optionally contain the name of a `running' log file. If used, the running log file is opened in `append' mode by each program in the package, and a copy of all screen or printed output is added to the file along with a time and date stamp indicating what program was run and when. This allows the user to maintain a complete history of all calculations and results on a given problem in a single, chronologically accurate file.

25.2.1.3. Merging and scaling native and derivative data

| top | pdf |

The programs CMBISO and CMBANO (both interactive) are used to combine unique native and derivative data sets into a single file and place the derivative data set on the scale of the native. All common reflections are identified, paired together, scaled and output to a single `scaled' file. With CMBISO, only mean structure-factor amplitudes are used for both native and derivative data, i.e. Bijvoet mates are deemed equivalent and averaged. CMBANO functions similarly, except that for the derivative data the individual Bijvoet mates are not averaged, and both values are output to the scaled file. The overall merging R factor is reported both on F and [F^{2}], along with tables indicating the R factor as a function of resolution, F magnitude and [|F/\sigma (F)|]. A table is also output indicating the mean value of [F_{PH} - F_{P}] as a function of resolution, where [F_{PH}] and [F_{P}] are the derivative and native structure-factor amplitudes, respectively. By default, scaling is initially carried out by the relative Wilson method (Wilson, 1949[link]), with other optional procedures as outlined below to follow if desired.

25.2.1.3.1. Relative Wilson scaling

| top | pdf |

With this method, the derivative scattering, on average, is made equal to the native scattering by plotting [-\ln \left({\langle F_{PH}^{2}\rangle \over \langle F_{P}^{2}\rangle}\right) \ versus \ \left\langle {\sin^{2} (\theta) \over \lambda^{2}}\right\rangle, \eqno(25.2.1.1)] with the averages taken in corresponding resolution shells. A least-squares fit of a straight line to the plot yields a slope equal to [2(B_{PH}-B_{P})] (twice the difference between overall isotropic temperature parameters for derivative and native data sets) and an intercept of ln [K^{2}]. From these values, the derivative data are put on the scale of the native by multiplying each derivative amplitude by [K \exp \left[(B_{PH} - B_{P}) {\sin^{2} (\theta) \over \lambda^{2}}\right]. \eqno(25.2.1.2)]

25.2.1.3.2. Global anisotropic scaling

| top | pdf |

With this option, applied after relative Wilson scaling, the unique parameters of a symmetric 3 × 3 scaling tensor S are determined by two cycles of least-squares minimization of [\textstyle\sum\limits_{hkl}\displaystyle W_{hkl} (F_{P} - SF_{PH})^{2} \eqno(25.2.1.3)] with respect to S, where [W_{hkl}] is a weighting factor, [{S = S_{11} O_{x}^{2} + S_{22} O_{y}^{2} + S_{33} O_{z}^{2} + 2(S_{12} O_{x} O_{y} + S_{13} O_{x} O_{z} + S_{23} O_{y} O_{z})} \eqno(25.2.1.4)] and [O_{x}], [O_{y}], [O_{z}] are direction cosines of the reciprocal-lattice vector expressed in an orthogonal system. The derivative data are then placed on the scale of the native by multiplying each derivative amplitude by the appropriate S.

25.2.1.3.3. Local scaling

| top | pdf |

With this option, again applied after relative Wilson scaling, a scale factor for each reflection is also determined by minimizing equation (25.2.1.3)[link] with respect to S, but here S is a scalar and the summation is taken only over neighbouring reflections within a sphere centred on the reflection being scaled. The sphere radius is initially set to include roughly 125 neighbours, and the scale factor is accepted if at least 80 are actually present. If insufficient neighbours are available, then the sphere size is increased incrementally and the process repeated until a preset maximum radius is encountered. If the maximum is reached, the process terminates with the message that the data set is too sparse for local scaling. Scaling is achieved by multiplying each derivative amplitude by the appropriate S.

25.2.1.3.4. Outlier rejection

| top | pdf |

Rejection of outliers is often desirable, as erroneously large isomorphous or anomalous differences can lead to streaks in difference-Patterson maps and complicate identification of heavy-atom or anomalous-scatterer sites. The interactive program TOPDEL facilitates identification and rejection of such outliers while selecting reflections for use in difference-Patterson calculations. An input `scaled' file is read in, and user-supplied resolution and [F/\sigma(F)] cutoffs are applied. The data are then sorted in descending order of magnitude of ΔF (either isomorphous or anomalous differences) and the largest differences are listed for examination. The user is then prompted to determine which, if any, of the large differences are to be rejected as outliers and to determine what percentage of the remaining largest differences are to be used in the Patterson-map synthesis. The appropriate Fourier-coefficient file is then created.

25.2.1.4. Fourier-map calculations

| top | pdf |

All Fourier maps, including native- and difference-Patterson maps, are computed by the program FSFOUR, which runs in batch mode and is a space-group-general variable-radix 3D fast Fourier transform program. Unique reflections are expanded to a hemisphere, and the calculation then proceeds in P1. The output map always spans one full unit cell.

25.2.1.4.1. Submaps

| top | pdf |

Selected regions of an electron-density map that are useful for NC symmetry applications can be extracted from the full-cell maps produced by FSFOUR with the programs EXTRMAP (batch) or MAPVIEW (interactive). The `submap' regions can cover any arbitrary volume and cross multiple cell edges if desired.

25.2.1.4.2. Orthogonal and skewed maps

| top | pdf |

Programs MAPORTH and SKEW (both run in batch mode) are provided to modify submaps, as modification is sometimes useful or required with NC symmetry applications. MAPORTH simply converts the map to correspond to an orthogonal grid, which simplifies refinement of NC symmetry operators. SKEW also converts the map to an orthogonal grid, but changes the axis directions such that the new b axis can be arbitrarily oriented. This is useful in NC symmetry applications where one may want to examine maps looking directly down the NC symmetry rotation axis. Both programs compute density values at the new grid points by using a 64-point cubic spline interpolation and can also orthogonalize or skew masks to maintain correspondence with the modified submaps.

25.2.1.4.3. Graphics maps and skeletonization

| top | pdf |

Program GMAP (interactive) is used to extract any region from a FSFOUR map, possibly crossing multiple cell edges, and convert it to a form directly readable by the external interactive graphics programs TOM [SGI version of FRODO (Jones, 1978[link])], O (Jones et al., 1991[link]) or CHAIN (Sack, 1988[link]). In addition to the output map file, one may also output a corresponding skeleton (Greer, 1974[link]) file (for TOM) or skeleton data block (for O) to facilitate chain tracing.

25.2.1.4.4. Peak search

| top | pdf |

Program PSRCH (batch) lists the largest peaks in a Fourier map and is useful in identifying additional heavy-atom or anomalous-scatterer sites from a map phased by a tentative model. Either positive or negative peaks can be listed, with the latter sometimes useful in MAD phasing applications, depending on the assignment of `native' and `derivative' data sets. Only unique peaks are listed, and the peak positions are interpolated from the map.

25.2.1.5. Structure-factor and phase calculations

| top | pdf |

Several methods are used for structure-factor and phasing calculations depending on the nature of the model and how the results will be used. The methods available in the package are described below.

25.2.1.5.1. By heavy-atom or anomalous-scattering methods

| top | pdf |

Phasing by heavy-atom-based methods (isomorphous replacement and/or anomalous scattering) begins when one or more `scaled' data sets are input to the program PHASIT (batch). User-specified rejection criteria are first applied to each data set, and structure factors corresponding to the heavy-atom or anomalous-scatterer substructure are computed from [F_{hkl} = Sc \textstyle\sum\limits_{j}\displaystyle O_{j}\ f_{j}\exp \left\{- B_{j} \left[\sin^{2} (\theta)/\lambda^{2}\right]\right\} \exp\left[2 \pi i (hx_{j} + ky_{j} + lz_{j})\right], \eqno(25.2.1.5)] where [O_{j}] is the occupancy, [f_{j}] is the (possibly complex) scattering factor, [B_{j}] is the isotropic temperature parameter and [x_{j}], [y_{j}], [z_{j}] are the fractional coordinates of the jth atom. Sc is a scale factor relating the calculated structure factor (absolute scale) to the scale of the observed data. The summation is taken over all heavy atoms or anomalous scatterers in the unit cell. Alternatively, anisotropic temperature parameters can be used for each atom if desired. A subset of reflections comprising all centric data (plus the largest 25% of the isomorphous or anomalous differences if there are insufficient centric data) is selected and used to estimate Sc by a least-squares fit to the observed differences. Initial estimates of the `standard error' E (expected lack of closure) are determined from this subset as a function of F magnitude, treating centric and acentric data separately. SIR (single isomorphous replacement) or SAS (single-wavelength anomalous scattering) phase probability distributions are given by [P(\varphi) = k\exp [-e (\varphi)^{2}/2E^{2}], \eqno(25.2.1.6)] where the lack of closure is defined by [e(\varphi) = F^{2}_{PH_{\rm (obs)}} - F^{2}_{PH_{\rm (calc)}} (\varphi) \eqno(25.2.1.7)] for isomorphous-replacement data and [{e(\varphi) = [(F_{PH}^{+})^{2} - (F_{PH}^{-})^{2}]_{\rm obs} - \{[F_{PH}^{+} (\varphi)]^{2} - [F_{PH}^{-}(\varphi)]^{2}\}_{\rm calc}} \eqno(25.2.1.8)] for anomalous-scattering data, with the [+] and − superscripts denoting members of a Bijvoet pair, and [F_{PH_{\rm (calc)}}^{2}(\varphi) = F_{P}^{2} + F_{H}^{2} + 2F_{P}F_{H} \cos(\varphi - \varphi_{H}), \eqno(25.2.1.9)] with ϕ denoting the protein phase, and [F_{H}] and [\varphi_{H}] denoting the heavy-atom structure-factor amplitude and phase, respectively. The distributions, however, are cast in the A, B, C, D form (Hendrickson & Lattman, 1970[link]). After all input data sets are processed in this manner, the individual phase probability distributions for common reflections are combined via [P(\varphi)_{\rm comb} = k \exp [\cos (\varphi) \textstyle\sum\limits_{j} A_{j}\displaystyle + \sin (\varphi) \textstyle\sum\limits_{j}\displaystyle B_{j} + \cos (2\varphi) \textstyle\sum\limits_{j}\displaystyle C_{j} + \sin (2\varphi) \textstyle\sum\limits_{j}\displaystyle D_{j}], \eqno(25.2.1.10)] with k as a normalization constant and the sums taken over all contributing data sets. The resulting combined distributions are then integrated to yield a centroid phase and figure of merit for each reflection. The standard error estimates, E, as a function of structure-factor magnitude are then updated for each data set, this time using all reflections and a probability-weighted average over all possible phase values for the contribution from each reflection (Terwilliger & Eisenberg, 1987a[link],b[link]). With these updated standard error estimates, the individual SIR and/or SAS phase probability distributions are recomputed for all reflections and combined again to yield an improved centroid phase and figure of merit for each reflection. The resulting phases, figures of merit and probability distribution information are then available for use in map calculations or for further parameter or phase refinement. This method is used to produce MIR (multiple isomorphous replacement), SIRAS (single isomorphous replacement with anomalous scattering) MIRAS (multiple isomorphous replacement with anomalous scattering) and MAD phases as well as other possible phase combinations.

25.2.1.5.2. Directly from atomic coordinates

| top | pdf |

Structure-factor amplitudes and phases for a macromolecular structure can be computed directly from atomic coordinates corresponding to a tentative model with the programs PHASIT and GREF (both run as batch processes). This allows one to obtain structure-factor information from an input model typically derived from a partial chain trace or from a molecular-replacement solution. Equation (25.2.1.5)[link] is used, but this time the sum is taken over all known atoms in the cell, and the scale factor is refined by least squares against the native amplitudes rather than against the magnitudes of isomorphous or anomalous differences. The computed structure factors may be used directly for map calculations, including `omit' maps, or for combination with other sources of phase information. One can output probability distribution information for the calculated phases, if desired, as well as coefficients for various Fourier syntheses, including those using sigma_A weighting (Read, 1986[link]) for the generation of reduced-bias native or difference maps.

25.2.1.5.3. By map inversion

| top | pdf |

For the purpose of improving phases by density-modification methods, such as solvent flattening, negative-density truncation and/or NC symmetry averaging, one must compute structure factors by Fourier inversion of an electron-density map rather than from atomic coordinates. The program MAPINV (batch) is a companion program to FSFOUR and carries out this inverse Fourier transform. It accepts a full-cell map in FSFOUR format and inverts it to produce amplitudes and phases for a selected set of reflections when given the target range of Miller indices. A variable-radix 3D fast Fourier transform algorithm is used. Optionally, the program can modify the density prior to inversion by truncation below a cutoff and/or by squaring the density values. Other types of density modification are handled by different programs in the package and are carried out prior to running MAPINV. The indices, calculated amplitude and phase are written to a file for each target reflection.

25.2.1.6. Parameter refinement

| top | pdf |

Several methods are provided for refinement of heavy-atom or anomalous-scatterer parameters and scaling parameters, depending on the desired function to be minimized. In all cases, the structure factor [F_{H}] corresponding to the heavy atom or anomalous scatterer is given by equation (25.2.1.5)[link]. The options available are briefly described below.

25.2.1.6.1. Against amplitude differences

| top | pdf |

The simplest procedure is to refine against the magnitudes of isomorphous or anomalous structure-factor amplitude differences, which can be carried out with the program GREF (batch mode). In this case, one minimizes [\textstyle\sum\limits_{j} W_{j} \left(|F_{PH_{j}} - F_{P_{j}}| - F_{H_{j}}\right)^{2} \eqno(25.2.1.11)] for isomorphous-replacement data or [\textstyle\sum\limits_{j} W_{j} \left(|F_{PH_{j}}^{+} - F_{PH_{j}}^{-}| - 2F_{H_{j}}\right)^{2} \eqno(25.2.1.12)] for anomalous-scattering data with respect to the desired parameters contributing to [F_{H}], where [W_{j}] is a weighting factor. For anomalous-scattering data, only the imaginary component of the scattering factors is used during the [F_{H}] structure-factor calculation. For isomorphous-replacement data, the summation is taken only over centric reflections, plus the strongest 25% of differences for acentric reflections if insufficient centric data are present. For anomalous-scattering data, the summation is taken only over the strongest 25% of Bijvoet differences. An advantage of these methods is that only data from the derivative being refined are used (plus the native with isomorphous data), hence there is no possibility of feedback between other derivatives which may not be truly independent. A disadvantage is that, apart for the centric reflections, the target value in the minimization is only an approximation to the true [F_{H}]. The accuracy of this approximation is improved by restricting the summations to the strongest differences.

25.2.1.6.2. By minimizing lack of closure

| top | pdf |

An alternative procedure available in the program PHASIT (batch) is to refine against the observed derivative amplitudes. In this case, one minimizes the `lack of closure' (now based on amplitudes instead of intensities) with respect to the desired parameters contributing to [F_{PH}], including the derivative-to-native scaling parameters. In all cases, the calculated derivative amplitudes [F_{PH_{\rm (calc)}}] are obtained from equation (25.2.1.9)[link]. To use this procedure, one must have an estimate of the protein phase ϕ. Several variations of this method, all available in PHASIT, are described below and are generally referred to as `phase refinement'.

25.2.1.6.2.1. `Classical' phase refinement

| top | pdf |

With this option, one minimizes [\textstyle\sum\limits_{j} W_{j} [F_{PH_{\rm (obs)}} - F_{PH_{\rm (calc)}} (\varphi)]^{2} \eqno(25.2.1.13)] for isomorphous-replacement data or [\textstyle\sum\limits_{j} W_{j} \left\{(F_{PH}^{+} - F_{PH}^{-})_{\rm obs} - [F_{PH}^{+} (\varphi) - F_{PH}^{-} (\varphi)]_{\rm calc}\right\}^{2} \eqno(25.2.1.14)] for anomalous-scattering data with respect to the desired parameters. Typically, the weights are taken as the reciprocal of the `standard error' (expected lack of closure) or its square. The summations are taken over all reflections for which the protein phase is thought to be reasonably valid, usually implied by a figure of merit of 0.4 or higher. The protein phase estimate usually comes from the centroid of the appropriate combined phase probability distribution given by equation (25.2.1.10)[link]; however, one has the option of including all data sets when combining the distributions, or including all except that for the derivative being refined. Once new heavy-atom and scaling parameters are obtained, new individual SIR or SAS phase probability distributions are computed and combined to provide new protein phases, and these phases are used to update the standard error estimates as described earlier. Then the individual distributions are recomputed once more using the new standard error estimates, and these distributions are combined again to give new protein phase estimates. The process is then iterated using the new phases and new heavy-atom parameters to start another round of refinement. After several iterations, the heavy-atom parameters, standard error estimates and protein phase estimates converge to their final values.

25.2.1.6.2.2. Approximate-likelihood method

| top | pdf |

This variation, also available in PHASIT, is similar to the classical phase refinement described above, except that instead of using only a single value for the protein phase ϕ during the calculation of [F_{PH}], all possible values are considered, with each contribution weighted by the corresponding protein phase probability (Otwinowski, 1991[link]). One minimizes [\textstyle\sum\limits_{j} W_{j} \textstyle\sum\limits_{i} P_{i} [F_{PH_{\rm (obs)}} - F_{PH_{\rm (calc)}} (\varphi_{i})]^{2} \eqno(25.2.1.15)] with respect to the desired parameters for isomorphous-replacement data, where [P_{i}] is the protein phase probability and the inner summation is over all allowed protein phase values, stepped in intervals of 5° (or 180° for centric reflections). For anomalous-scattering data, a similar modification is made to equation (25.2.1.14)[link]. The weights may be as in the classical phase refinement case or unity. Since each contribution is weighted by its phase probability regardless, there is no need to use a high figure-of-merit cutoff, as was done earlier. In fact, very good results are usually obtained using unit weights for [W_{j}] (that is, only the probability weighting) and a figure-of-merit cutoff of around 0.2 for inclusion of reflections in the summations. This variation has been found to increase stability in the refinement and works considerably better than conventional phase refinement when the phase probability distributions are strongly multimodal. Parameter refinement and phasing iterations proceed as described earlier. The combination of probability weighting during refinement with probability weighting during standard error estimation enables the key features of maximum-likelihood refinement to be carried out, although only approximately.

25.2.1.6.2.3. Using external phase information

| top | pdf |

When using either the conventional phase refinement or approximate-likelihood methods, protein phase estimates are required. In the former case, only a single value is used, whereas in the latter, information about all possibilities is provided by way of the phase probability distribution. Normally, this information comes from a prior phasing calculation; thus, the estimates are typically SIR, SAS, MIR etc. phases. However, in PHASIT, an option allows one to read in the protein phase information from an external source. This enables parameter refinement (by either conventional or approximate-likelihood methods) using protein phase estimates that are improvements over the initial ones. For example, one could get the best phases by one of the previously described methods, but then improve them by density-modification procedures, such as solvent flattening or negative-density truncation and/or NC symmetry averaging. Using these improved phases in the calculation of [F_{PH}] when refining should then lead to more accurate heavy-atom and scaling parameters, which in turn will produce still better protein phases. These new protein phases can either be treated as final and used to produce an electron-density map for interpretation, or be used to initiate another round of phase improvement by density modification. There are several cases where this type of refinement has been beneficial, and it is particularly useful for the refinement of derivative-to-native scaling parameters.

25.2.1.6.3. Rigid-group refinement

| top | pdf |

Although GREF can be used to refine individual heavy-atom or anomalous-scatterer parameters against isomorphous or anomalous structure-factor difference magnitudes, it is actually a group refinement program. Thus, all entities to be refined are treated as rigid bodies such that only group orientations, positions, scaling and temperature parameters can be refined. The groups, however, can be defined arbitrarily. For individual heavy-atom sites, they are simply defined as single atom `groups', and no orientation parameters are selected for refinement. This enables the program to serve two additional roles. In the case where the heavy-atom reagent is known to contain a rigid group, it can be properly treated. Also, if one chooses the target values to be native structure-factor amplitudes instead of difference magnitudes and inputs an entire protein molecule or domain, then conventional rigid-body or segmented rigid-body refinement can be carried out. The output consists of the refined parameters and a Fourier-coefficient file suitable for map or phase combination calculations.

25.2.1.7. Origin and hand correlation, and completing the heavy-atom substructure

| top | pdf |

Several programs are provided to enable the computation and analysis of various types of difference-Fourier maps as an aid to completing the heavy-atom structure by picking up additional sites. They are also used to correlate the origin and hand between derivatives and to determine the absolute configuration. During phasing calculations in PHASIT, files suitable for isomorphous or Bijvoet difference-Fourier calculations are automatically produced for each derivative or data set and can be used directly in program FSFOUR. The procedures used are described below.

25.2.1.7.1. Difference and cross-difference Fourier syntheses

| top | pdf |

The files produced by PHASIT for isomorphous data sets contain the information needed to produce the Fourier coefficients [[F_{H_{\rm (obs)}} - F_{H_{\rm (calc)}}] \exp [i\varphi_{H_{\rm (calc)}}], \eqno(25.2.1.16)] where [F_{H_{\rm (calc)}}] and [\varphi_{H_{\rm (calc)}}] are the calculated heavy-atom structure-factor amplitude and phase, respectively, and [F_{H_{\rm (obs)}}] is computed from [F_{H_{\rm (obs)}}^{2} = F_{PH}^{2} + F_{P}^{2} - 2F_{PH} F_{P} \cos (\varphi_{PH} - \varphi_{P}) \eqno(25.2.1.17)] where [\varphi_{PH}] and [\varphi_{P}] are the current derivative and native phases, respectively. These coefficients are more accurate than using simple isomorphous difference magnitudes to approximate [F_{H_{\rm (obs)}}] and can be computed once phasing has begun, since estimates of the required phase differences are then available. Alternatively, the program MRGDF (interactive) can be used to produce Fourier coefficients of the form [m(F_{PH} - F_{P}) \exp (i\varphi_{P}), \eqno(25.2.1.18)] where m is the current figure of merit. This method suffers somewhat as phase differences are ignored, but it has the advantage that the amplitude difference does not necessarily involve any derivative previously used in the computation of [\varphi_{P}]. If amplitudes from a new derivative and from the native are used, then peaks in the resulting `cross-difference' Fourier synthesis for the new derivative will automatically correspond to the same origin and hand as prior sites used in the phasing process, although the hand may still be incorrect. Finally, GREF can be used to generate the Fourier coefficients [{(| F_{PH} - F_{P}|) \exp [i\varphi_{H_{\rm (calc)}}]\quad\hbox{or}\quad [|F_{PH} - F_{P}| - F_{H_{\rm (calc)}}] \exp [i\varphi_{H_{\rm (calc)}}],} \eqno(25.2.1.19)] with the second set producing a map similar to that obtained using equation (25.2.1.16)[link]. Both coefficient sets in equation (25.2.1.19)[link] are lacking in that the phase difference is ignored, but the second set [and also those in equation (25.2.1.16)[link]] has the advantage that heavy-atom sites already in the model are subtracted away, allowing any remaining minor sites to stand out in the resulting map.

25.2.1.7.2. Bijvoet difference and cross-Bijvoet difference Fourier syntheses

| top | pdf |

The files produced by PHASIT for anomalous-scattering data sets contain the information needed to produce the Fourier coefficients [(F_{PH}^{+} - F_{PH}^{-})_{\rm obs} \exp [i(\varphi_{P}^{+} - \pi/2)] \eqno(25.2.1.20)] or [{[(F_{PH}^{+} - F_{PH}^{-})_{\rm obs} - (F_{PH}^{+} - F_{PH}^{-})_{\rm calc}] \exp [i(\varphi_{P}^{+} - \pi/2)],} \eqno(25.2.1.21)] where [\varphi_{P}^{+}] is the protein phase used when computing [F_{PH}^{+}]. The coefficients in equation (25.2.1.20)[link] correspond to a conventional Bijvoet difference Fourier map, which should show large positive peaks at the locations of anomalous-scattering sites when the hand is correct. The coefficients in equation (25.2.1.21)[link] correspond to the case in which contributions from known anomalous scatterers are subtracted out. As in the isomorphous-replacement case, a program MRGBDF (interactive) is also provided to generate the Fourier coefficients [m(F_{PH}^{+} - F_{PH}^{-})_{\rm obs} \exp [i(\varphi_{P}^{+} - \pi/2)], \eqno(25.2.1.22)] where the Bijvoet difference does not necessarily have to come from a derivative used in the phasing. If the difference doesn't come from a derivative used in phasing, then a `cross-Bijvoet difference' Fourier map is obtained, which should produce large positive peaks at anomalous-scatterer locations in the new derivative when the original hand is correct. Additionally, GREF can be used to generate the Fourier coefficients [{(|F_{PH}^{+} - F_{PH}^{-}|_{\rm obs}) \exp (i\varphi_{H}^{+}) \hbox{ or } [|F_{PH}^{+} - F_{PH}^{-}|_{\rm obs} - F_{H_{\rm (calc)}}^{+}] \exp (i \varphi_{H}^{+}),} \eqno(25.2.1.23)] where [F_{H}^{+}] and [\varphi_{H}^{+}] are the heavy-atom structure-factor amplitude and phase, used when computing [F_{PH}^{+}]. These coefficients can also be used to identify additional anomalous-scatterer sites, but they are insensitive to the hand. As in equation (25.2.1.21)[link], if the second set in equation (25.2.1.23)[link] is used, then contributions from anomalous scatterers already included in the phasing will be subtracted out.

Finally, the program HNDCHK (interactive) is provided to determine the enantiomorph by examination of a Bijvoet difference Fourier map. One inputs the map along with the anomalous-scatterer positions used in the phasing. The program then uses a 64-point cubic spline interpolation algorithm to obtain the density precisely at the input coordinates and also at coordinates related to them by a centre of symmetry. If the input heavy-atom configuration had the correct hand, large positive peaks should occur exactly at the input locations. If the hand is incorrect, even larger negative peaks occur at the true positions, i.e. those related to the input positions by a centre of symmetry.

25.2.1.8. Solvent flattening and negative-density truncation

| top | pdf |

Solvent flattening with negative-density truncation is efficiently carried out by the programs BNDRY, FSFOUR, MAPINV and RMHEAVY, all of which are run in batch mode with multiple iterations under the control of a command procedure or shell script. The various aspects of the process as implemented are described below.

25.2.1.8.1. Mask construction

| top | pdf |

Solvent-mask construction follows the procedure suggested by Wang (1985)[link], with the exception that electron density in the vicinity of heavy-atom sites is temporarily ignored during the mask-building process. This allows one to use a tight solvent mask, which maximizes the phasing power of the method while preventing artificial extension of the protein envelope into the solvent region in the vicinity of surface-bound heavy-atom sites. Failure to do this has occasionally been found to deplete the protein region elsewhere to compensate for the incorrectly extended region.

25.2.1.8.1.1. Automated mask construction

| top | pdf |

An electron-density map produced by FSFOUR is passed to the program RMHEAVY along with a set of heavy-atom locations and a blanking radius. A copy of the map is then made that is identical to the original except that density values within the blanking radius of any heavy-atom site are set to zero. The modified map is then passed to program MAPINV, which sets to zero all density values that were negative (note that the [F_{000}] coefficient is not included in program FSFOUR) and then computes the corresponding set of structure factors by Fourier inversion. These structure factors are then passed to program BNDRY along with a resolution-dependent averaging radius R to compute the Fourier transform of the direct-space weighting function, [{W(r) = 1 - r/R \hbox{ if } r \leq R\quad\hbox{and}\quad W(r) = 0 \hbox{ if } r\;\gt\;R,} \eqno(25.2.1.24)] where W(r) is the weighting function and r is the distance from the map grid point being evaluated. R is typically 2.5–3 times the minimum d spacing in the data set. Each unique structure factor obtained from map inversion is then multiplied by the transform of W(r), f(s), given by [f(s) = 4\pi R^{3} \{2 [1 - \cos (A)] - A \sin (A)\}/A^{4}, \eqno(25.2.1.25)] where [A = 4\pi R \sin (\theta/\lambda). \eqno(25.2.1.26)] These weighted structure factors are then input to FSFOUR to compute a `smeared' map, which corresponds to convolution of all non-negative density in the original map with the weighting function W(r). The `smeared' map is then passed to BNDRY along with an estimate of the solvent fractional volume. The fractional volume is converted to the corresponding number of grid points occupied by solvent, and a histogram is constructed identifying the number of grid points associated with each density value. Starting with the lowest observed density, a threshold value is increased incrementally, and a running sum is maintained identifying the current number of grid points with density values below the threshold. When the number of points accumulated reaches the expected number in the solvent region, the corresponding threshold indicates the density value for the protein–solvent boundary contour level. A mask map having a one-to-one correspondence with the map grid is then constructed such that if the density in the smeared map is less than the contour level, the grid point is deemed to be in the solvent region; otherwise, it is assigned to the protein region. The mask is then written to a file.

25.2.1.8.1.2. Masks from atomic coordinates

| top | pdf |

In some instances, it may be desirable to create masks, either for solvent flattening or NC symmetry averaging, from a set of atomic coordinates. The interactive program MDLMSK can be used for this as it accepts a set of atomic coordinates along with a masking radius, mask number and map region. It then creates a mask file spanning the requested map region such that all grid points within the region that are also within the masking radius of any model atom are assigned the specified mask value, and all other points a solvent mask value. If multiple masks are required, the interactive program MRGMSK can be used to combine separate mask files created by MDLMSK into a single mask file. For NC symmetry averaging purposes, one generally creates mask files separately for each independent molecule, using an average van der Waals radius as the masking radius, and then combines them with MRGMSK. This mask is then edited in the program MAPVIEW (see below) to maintain the outer boundary, but to fill in holes within the molecular interior. This mask can be used directly for NC symmetry averaging. If it is to be used for solvent flattening, then it must first be expanded to correspond to a full unit cell by the program BLDCEL.

25.2.1.8.1.3. Mask verification and manual editing

| top | pdf |

Both solvent masks and NC symmetry averaging masks can be examined and edited interactively using the program MAPVIEW. The program reads an electron-density map and (possibly) the corresponding mask. It then displays map sections contoured at any desired level. It can also be used to view the mask superimposed on the contoured map. One can scroll through all sections of the map one at a time, examining the corresponding mask assignment. If desired, one can manually edit the mask by tracing out the protein boundary using a cursor tied to a mouse, or even create the entire mask from scratch in this manner. Other features of MAPVIEW will be described later.

25.2.1.8.2. The flattening and truncation procedure

| top | pdf |

Once a solvent mask is constructed, solvent flattening and negative-density truncation is carried out using the program BNDRY. An electron-density map and corresponding mask are input along with an empirical constant S, which is used to estimate the value of [F_{000}/V] on the scale of the input map. The estimation follows the procedure of Wang (1985)[link] and is based on the assumption that for typical solvent conditions and proteins not containing heavy metals, the ratio of mean solvent electron density to maximum protein electron density is constant, although for phasing purposes the optimum values are resolution-dependent. Typical values of S are supplied in the package. One simply couples the value of S taken from known structures with density values obtained from the experimental maps to estimate [F_{000}/V] on the appropriate (but unknown) map scale by solving the equation [{\langle \rho \rangle _{\rm solvent} + F_{000}/V \over \rho_{\max\!,\, {\rm protein}} + F_{000}/V} = S \eqno(25.2.1.27)] for [F_{000}/V]. Once the estimate of [F_{000}/V] is obtained, solvent flattening and negative-density truncation are carried out simultaneously by resetting all map values according to the relationships [\let\normalbaselines\relax\openup3pt\matrix{\rho = \langle \rho_{\rm solvent} \rangle + F_{000}/V\hfill &\hbox{if in the solvent region},\hfill\cr \rho = \max (\rho_{\rm input} + F_{000}/V, 0) \hfill &\hbox{if in the protein region}.\hfill\cr} \eqno(25.2.1.28)]

25.2.1.9. Phase combination and extension procedures

| top | pdf |

Phase combination, either during density-modification procedures or to make use of partial structure information, is carried out by the BNDRY program (batch). For standard phase combination, two structure-factor files are input. The first file, called the `anchor' phase set, contains structure-factor information along with phase probability distributions in the form of A, B, C, D coefficients and usually corresponds to MIR, SIR, or MAD phases. The other file contains only `calculated' structure-factor amplitudes and phases and is usually obtained either from Fourier inversion of a modified electron-density map or from a structure-factor calculation based on atomic coordinates from a partial structure. Common reflections in both files are identified, and the `calculated' amplitudes are scaled to those in the anchor set by least squares. For phase combination, a variety of options are available, with the most important described below.

25.2.1.9.1. Modified Sim weights

| top | pdf |

The scaled data are sorted into bins according to d spacing, and a three-term polynomial is fitted to the mean values of [|F^{2}_{\rm obs} - F^{2}_{\rm calc}|] as a function of resolution. For each reflection, a unimodal phase probability distribution is constructed using a modification (Bricogne, 1976[link]) of the Sim (1959)[link] weighting scheme via [P(\varphi_{P}) = k \exp \left[{2F_{\rm obs} F_{\rm calc} \cos (\varphi_{P} - \varphi_{\rm calc}) \over \langle | F_{\rm obs}^{2} - F_{\rm calc}^{2}|\rangle}\right], \eqno(25.2.1.29)] where the average in the appropriate resolution range is determined from the polynomial. This distribution is cast in the A, B, C, D form with [\eqalignno{ A &= W \cos (\varphi_{\rm calc}), &\cr B &= W \sin (\varphi_{\rm calc}), &\cr C &= 0 &\cr D &= 0\;\;{\rm and}&\cr W &= {2F_{\rm obs} F_{\rm calc} \over \langle |F_{\rm obs}^{2} - F_{\rm calc}^{2}|\rangle}. &(25.2.1.30)\cr}] Phase combination with the anchor set then proceeds according to equation (25.2.1.10)[link], and the combined distributions are integrated to give a new phase and figure of merit for each reflection.

25.2.1.9.2. σA weights

| top | pdf |

As an alternative to the procedure above, in the BNDRY program the weights, W, used when constructing the unimodal probability distributions in equations (25.2.1.30)[link] can be computed according to [W = {2\sigma_{A} E_{\rm tot} E_{\rm par} \over 1 - \sigma_{A}}, \eqno(25.2.1.31)] where [E_{\rm tot}] and [E_{\rm par}] are normalized structure-factor amplitudes for the observed and calculated structure factors, respectively, and [\sigma_{A}] is determined by the procedure described by Read (1986)[link]. For acentric reflections, equation (25.2.1.31)[link] is used whereas for centric reflections, W is one half the value given by equation (25.2.1.31)[link].

25.2.1.9.3. Damping contributions

| top | pdf |

Normally, the distributions constructed for the calculated phases are combined with those for the anchor set with full weight in equation (25.2.1.10)[link]. However, in BNDRY, one can supply a damping factor in the range 0–1 to down-weight the contributions of the anchor set. The damping factor simply multiplies the distribution coefficients such that a factor of 1 (default) indicates no damping, and values less than one place more emphasis on the map-inverted or partial structure phases. If set to zero, the calculated phases are accepted as they are, since there is effectively no phase combination with the anchor set.

25.2.1.9.4. Phase extension

| top | pdf |

If phase extension is requested during the phase combination step, an additional file (prepared by the interactive program MISSNG) is also supplied to the BNDRY program. This file contains unique reflections absent from the anchor set but for which observed amplitudes (and possibly phase probability distribution coefficients) are available. Phase combination then proceeds exactly as above, except that for any extended reflections lacking phase probability information, the calculated phases are accepted as they are. Phase extension is required when phasing purely by SAS methods as it is the only way to phase centric reflections. As a final option, phase and amplitude extension is possible, in which case both the calculated amplitude and phase are accepted as they are for reflections having only indices provided on the extension file. This is sometimes desirable to include low-resolution reflections that may have been obscured by the beam stop.

25.2.1.10. Noncrystallographic symmetry calculations

| top | pdf |

Several programs are provided to carry out noncrystallographic symmetry averaging within submaps and are briefly described below.

25.2.1.10.1. Operator representation and definitions

| top | pdf |

NC symmetry operators are specified in terms of the parameters ϕ, ψ, χ, [O_{x}], [O_{y}], [O_{z}] and t, which refer to a Cartesian coordinate system in Å, obtained by orthogonalization of the unit cell as in the Protein Data Bank (Bernstein et al., 1977[link]). The angles ϕ and ψ determine the direction of the NC rotation axis, while χ determines the amount of rotation about it. [O_{x}], [O_{y}] and [O_{z}] are coordinates of a point through which the rotation axis passes, and t is a post-rotation translation parallel to the rotation axis. The relationships between the angles, orthogonal reference axes X, Y, Z and the unit cell are given in Fig. 25.2.1.2[link]. Coordinates for a pair of points related by NC symmetry are then expressed in the orthogonal system by [P_{2} = R_{\varphi,\, \psi,\, \chi} (P_{1} - O) + O + tD_{\varphi,\, \psi}, \eqno(25.2.1.32)] where [P_{1}] and [P_{2}] are three-element column vectors containing coordinates for the related points, [R_{\varphi,\, \psi,\, \chi}] is a 3 × 3 rotation matrix derived from the angles, O is a three-element column vector containing coordinates for a point through which the rotation axis passes, t is the post-rotation translation scalar in Å and [D_{\varphi,\, \psi}] is a three-element column vector containing direction cosines of the rotation axis. This type of parameterization simplifies transfer of information from self-rotation functions, which are usually calculated in spherical polar angles anyway, and also makes obvious pseudo-space-group symmetry type operations such as pseudo-screw axes. For convenience, a program O_TO_SP is provided to convert from a 3 × 3 rotation matrix and 1 × 3 column vector representation of the NC symmetry operation, as used in some programs, to the parameters described here.

[Figure 25.2.1.2]

Figure 25.2.1.2 | top | pdf |

Relationships between noncrystallographic symmetry rotation axis direction, orthogonal reference system axes X, Y, Z and crystallographic axes. The X axis is aligned with the crystal a. The Y axis is parallel to [{\bf a} \times {\bf c}^{*}]. The Z axis is parallel to [\bf{X} \times \bf{Y}], i.e. [{\bf c}^{*}]. ψ is the angle between the NC rotation and +Y axes. ϕ is the angle between the projection of the NC rotation axis in the XZ plane and the +X axis, with +ϕ counterclockwise when viewed from +Y toward the origin. χ is the amount of rotation about the directed axis, with +χ clockwise when viewed from the axis toward the origin.

25.2.1.10.2. Operator refinement

| top | pdf |

Refinement of the NC symmetry operator parameters is achieved by least-squares minimization of the squared difference in electron density for all NC-symmetry-related points. Thus, one minimizes [\textstyle\sum\displaystyle \{\rho (r) - \rho [R_{\varphi,\, \psi,\, \chi} (r - O) + O + tD_{\varphi,\, \psi}]\}^{2} \eqno(25.2.1.33)] with respect to the operator parameters, where the sum is taken over all points within the appropriate averaging envelope(s). One starts refinement with low-resolution data (∼6 Å) on a coarse (∼2 Å) grid and monitors progress by following the correlation coefficient between the related electron-density values. Once convergence is obtained, the calculation is resumed with higher-resolution data on a finer grid. Typically, a correlation coefficient of around 0.4 or higher (for a 3 Å MIR map, 1 Å grid) indicates that the operator has been correctly located. The operator refinement is confined to submaps and is facilitated by use of an orthogonal grid. A submap containing the molecules to be averaged is obtained from the programs MAPVIEW or EXTRMAP and can be converted to an orthogonal grid, if needed, by the program MAPORTH, as described earlier.

25.2.1.10.2.1. Simple rotational symmetry

| top | pdf |

For `proper' NC symmetry, only pure n-fold rotations are involved with n a small integer, i.e. twofold, threefold etc. In this case, only a single envelope mask encompassing all of the molecules to be averaged is needed for averaging and operator refinement, since one does not have to differentiate between molecules within the aggregate. Initial operator refinement can use a simple spherical mask of appropriate radius, with the sphere centred near the aggregate centre of mass and on the rotation axis. One can also use a mask created either by hand (described below) or created from atomic coordinates, as described earlier. For averaging purposes, however, a mask created by hand is usually desired. The NC symmetry operator refinement is carried out within the program LSQROT (batch).

25.2.1.10.2.2. Complex rotational and/or translational symmetry

| top | pdf |

For `improper' NC symmetry, where there are translational components and/or arbitrary rotation angles involved, separate envelope masks must be assigned to each molecule in the aggregate for both NC symmetry operator refinement and averaging. Initial operator refinement can proceed with spherical masks of an appropriate radius centred on the centre of mass of each molecule in the aggregate. As in the `proper' NC symmetry case, one can also use masks created by hand or generated from atomic coordinates for operator refinement, but hand-traced masks will be desired for the actual averaging. The NC symmetry operator refinement is carried out within the program LSQROTGEN (batch).

25.2.1.10.3. Averaging mask construction

| top | pdf |

Masks encompassing the region(s) to be averaged are usually created by hand in the interactive program MAPVIEW. Here, one reads in a submap comprising the desired region of whatever type of map is available, usually an MIR map. An appropriate contour level and initial section are selected and the contoured electron density for that section appears on the screen. One then selects the `add next section' menu item two or three times to create a projection over several sections of the map, since in the projection the molecular boundary is usually more obvious. Selecting the `trace mask' menu item then allows the user to hand-contour the molecular envelope by directing the cursor tied to a mouse or other pointing device. One then moves to an adjacent section and repeats the process until the complete 3D mask is obtained. To simplify matters and speed up the process, there are `copy next mask' and `copy previous mask' menu items allowing one to take advantage of the fact that the mask is a slowly changing function, particularly when near the centre of the molecule. One can use this feature to copy a mask from the previous or following section and apply it to the current section. Up to twelve distinct 3D masks can be selected. Each mask is colour-coded and can be simultaneously displayed superimposed on the contoured electron-density section. Once the mask is completed, the `make asu' menu item is selected to apply crystallographic symmetry operations to all points within the generated envelope masks. If these operations generate a point also within the envelope masks, the point is flagged in red to indicate that it is redundant, indicating that when tracing the mask, one inadvertently strayed into a symmetry-related molecule. After this check for redundancy, all points within the submap distinct from but related to points within the molecular envelopes by crystal symmetry are flagged in green. This enables one to detect packing contacts and also to ensure that all significant electron density has been assigned to some envelope. Upon completion, the mask is written to a file suitable for use either in averaging, solvent flattening (after expansion by BLDCEL), or operator refinement. In cases of `proper' NC symmetry, it is often desirable to trace the averaging envelope mask in a `skewed' map, such that one is looking directly down the NC rotation axis. In this case, it is usually very obvious where the NC symmetry breaks down, simplifying identification of the averaging envelope. If the averaging mask is created in a skewed submap, then the batch program TRNMSK can be used to transform it so as to correspond to the original, unskewed submap for use in averaging calculations (which do not require skewing).

25.2.1.10.4. Map averaging

| top | pdf |

All averaging calculations are carried out by the program MAPAVG (batch), which requires the submap to be averaged along with the envelope masks and NC symmetry operators. A copy of the input submap is made and each grid point in the mask is examined in turn. If the grid point lies within any averaging envelope, then all points related to it by NC symmetry are generated from the operators and examined. If the generated points also lie within the appropriate envelope mask, the electron density there is interpolated, as described earlier, and the density values for all related points are summed. The average value of the electron density is then inserted at the original point in the submap copy. Upon completion, the averaged version of the submap is written to a file and correlation coefficients for regions related by the various NC symmetry operations are output. The averaged submap is then passed to the program BLDCEL along with the averaging mask and the original unaveraged FSFOUR map from which the submap was created. For all points within the averaging envelope(s), their electron-density values and those at points related by crystallographic symmetry are inserted into the full-cell map, and it is written to a file. This file then contains the NC symmetry averaged electron density expanded to a full-cell map that obeys space-group symmetry. As an option, the averaging mask can also be expanded in BLDCEL to a full-cell mask, which could then be used for solvent flattening.

25.2.1.10.4.1. Single-crystal averaging

| top | pdf |

For NC symmetry averaging within a single crystal, the calculations are exactly as described above. One refines the NC symmetry operators with LSQROT or LSQROTGEN, creates the appropriate envelope mask(s) with MAPVIEW, averages with MAPAVG and expands the averaged submap to a full cell with BLDCEL.

25.2.1.10.4.2. Multiple-crystal averaging

| top | pdf |

If multiple crystal forms are available and one has a source of phase information for each crystal form, then averaging over the independent molecular copies within all crystal forms is possible. In fact, one may also have NC symmetry within some of the crystal forms. One can utilize all of this information during averaging by exactly the same process as previously described. For each form, the appropriate envelope mask(s) must be obtained and any internal NC symmetry operators refined, as described earlier. Then operators relating molecules from one crystal form to another must be obtained and refined. The program LSQROTGEN can read in multiple submaps, allowing refinement of the additional operators. The program MAPAVG accepts submaps from up to six different crystal forms. Averaging over all copies then proceeds exactly as described above, except that prior to averaging, density in all submaps is placed on a common scale, and upon completion averaged submap files are written for each crystal form.

25.2.1.10.5. Phase combination and extension

| top | pdf |

During NC symmetry averaging, phase combination and extension is carried out precisely as described during solvent flattening and negative-density truncation. The only difference is that after generation of each electron-density map, the NC symmetry averaging is carried out on the appropriate submap region, which is then expanded back to a full-cell map prior to each solvent-flattening calculation.

25.2.1.11. Automated iterative processing

| top | pdf |

The most common iterative processes are carried out by shell scripts or command procedures. These procedures merely direct the flow of map, mask, structure-factor and control-data files between the various programs, while controlling the number of iterations in the process. Generally, one does not have to alter these scripts, although expert users may want to in special circumstances.

25.2.1.11.1. The DOALL procedure

| top | pdf |

A script to carry out a standard solvent-flattening run is provided along with a description of the expected input files, output files and examples. Not surprisingly, this DOALL procedure does it all. Execution of the script will create a map from an input `anchor' set of phases, typically obtained by MIR, SIR, or MAD methods, and will then create a solvent mask from the map after zeroing out density near heavy-atom sites. This solvent mask is used in four cycles of solvent flattening, combining the map-inverted phase information with the anchor phases. A new solvent mask is then generated, starting from a map produced with the phases after the first four cycles. Four cycles of solvent flattening using the second solvent mask are then carried out, restarting from the original map and combining with the anchor phases. These phases are then used to compute a new map from which a third solvent mask is built. The third mask is then used for eight cycles of solvent flattening, again restarting with the original map and combining with the anchor phases. Supplied in the script, but commented out, are instructions to carry out an arbitrary number of additional phase extension cycles, and then an arbitrary number of phase and amplitude extension cycles, all using the third solvent mask. The combined phases and distribution coefficients are written to a file after all cycles with a given mask are completed.

25.2.1.11.2. The EXTNDAVG and EXTNDAVG_MC procedures

| top | pdf |

Additional scripts are provided to carry out phase extension and/or NC symmetry averaging iterations. These scripts are executed after completion of a normal solvent-flattening run with the DOALL procedure. With the EXTNDAVG script, an input number of additional solvent flattening and/or phase combination cycles are carried out, and phase (and possibly amplitude) extension may be requested. Initial and final d spacings are input to the program SLOEXT (batch) along with the number of map modification or phase combination iterations per step, where each step represents the extension by one reciprocal-lattice point in each direction if phase extension is to be carried out. The calculations proceed where the DOALL script leaves off, starting with a map made from the final phases and using the third solvent mask. If NC symmetry averaging is to be carried out, after each map calculation the appropriate submap is extracted from it and is passed to MAPAVG along with the averaging mask. The averaged submap is passed to BLDCEL, where it is expanded to a full-cell map, which then is passed to BNDRY for solvent flattening. Map inversion and phase combination then proceed normally (although possibly with phase extension). Note that, in general, separate masks are used for solvent flattening and averaging.

The EXTNDAVG_MC script carries out the same procedures and options as the EXTNDAVG script, except that it is used when carrying out NC symmetry averaging with multiple crystal forms. Starting phase files, anchor phases, solvent masks, averaging masks and control files are provided for each crystal form. For each form, the solvent flattening and phase combination steps are carried out independently with the appropriate data; however, during the averaging step, maps from all crystal forms are involved.

25.2.1.12. Graphical capabilities

| top | pdf |

To facilitate visual evaluation of phasing results and input data, several (mainly interactive) programs are provided within the package. The programs are used to display contoured electron-density or Patterson maps, for interactive editing of solvent or averaging masks, and for visualization of input or difference diffraction data on workstation monitors or terminals. In most instances, hard copies for inclusion in manuscripts are also obtainable. The interactive graphics programs MAPVIEW, PRECESS and VIEWPLT are provided with two versions of each: one for use on Silicon Graphics workstations and the other (indicated by the same program name but ending in _X) for use on any display device supporting the X-Window protocol. The functionality, input and documentation are identical in both versions of each program.

25.2.1.12.1. Pseudo-precession photographs

| top | pdf |

The interactive program PRECESS is provided to display diffraction data in the form of pseudo-precession photographs. One can display any zone or step through all zones, with the corresponding intensities mapped to a colour scheme. If a grey scale is selected, the image looks very much like a properly exposed precession photograph taken with Polaroid film. When the cursor is placed near a reciprocal-lattice point, the Miller indices, intensity, standard deviation and d spacing are displayed, allowing one to quickly confirm or identify space groups and Laue symmetry. If a scaled file is input containing isomorphous-replacement or anomalous-scattering data, one can display the corresponding intensity differences instead of the native intensities and quickly visualize the distribution of differences to help assess isomorphism.

25.2.1.12.2. Interactive contouring or mask editing

| top | pdf |

The interactive program MAPVIEW can be used to examine contoured electron-density or Patterson maps, as well as to examine, create or edit solvent or averaging masks. Either full-cell FSFOUR maps or submaps (including skewed submaps) can be used, although only from the former can any arbitrary region be obtained and reordered interactively. The mask creation and editing functions have been described earlier. The program is very useful for Patterson analysis, evaluation of phasing results and to help decide which region is appropriate for isolating a molecule for use in model building. It is usually crucial for construction of averaging masks, but is also useful for examining or editing other masks.

25.2.1.12.3. Off-line contouring

| top | pdf |

While MAPVIEW is extremely useful, there are times when it is desirable to have individual plots available either for comparison, stereo viewing of electron density, or incorporation into documents. The program CTOUR (batch) handles these functions and accepts an input FSFOUR map or submap. The CTOUR program can create any number of plot files in a single run, with each consisting of either an individual section, a mono projection, or a stereo projection, with each projection over different multiple sections. If full-cell FSFOUR maps are input, any desired region may be selected, whereas if submaps (including skewed maps) are input, the accessible regions are limited by those present in the input map.

25.2.1.12.4. Generic plot files and drivers

| top | pdf |

The plot files created by CTOUR are generic in nature and are not directly displayable. One needs a driver program to convert the generic files to the format appropriate for the desired display device. The appropriate drivers for several popular display devices are provided within the package and are described below.

25.2.1.12.4.1. GL displays

| top | pdf |

For display on Silicon Graphics workstations, the interactive program VIEWPLT can be used to examine the generic plots created by CTOUR. Up to ten plots can be displayed simultaneously. It is particularly useful to display the various contoured Harker sections simultaneously during difference-Patterson interpretation.

25.2.1.12.4.2. X-Window displays

| top | pdf |

For display of CTOUR plots on monitors supporting the X-Window protocol, including most workstation monitors and X-terminals, the program VIEWPLT_X can be used instead of VIEWPLT. The functionality is identical to the GL version.

25.2.1.12.4.3. PostScript files

| top | pdf |

The interactive program MKPOST is provided to generate standard PostScript equivalents from the generic plot files produced by CTOUR. Multiple plot files can be generated in the same process. The PostScript files can be printed, viewed with a PostScript previewer, or incorporated into other documents.

25.2.1.12.4.4. Tektronix output

| top | pdf |

The interactive program PLTTEK can be used to display the generic plots created by CTOUR on any device supporting Tektronix 4010 emulation. While slow, this enables visualization of the plots on many `dumb' terminals.

25.2.1.13. Auxiliary programs

| top | pdf |

In addition to the major programs already described, a number of auxiliary programs (all interactive) are provided in the package to aid the user in porting information to or from external software and to assess phasing methods. These programs are briefly described below.

25.2.1.13.1. Coordinate conversions

| top | pdf |

Within the package, fractional atomic coordinates are used extensively, and the program [{PDB\_CDS}] is provided to convert from PDB (Protein Data Bank) to PHASES coordinate files and vice versa. The program prompts for input and output file names, the direction of the conversion, chain or residue ranges, and whether to reset occupancies and/or thermal factors to specified values. The coordinate ranges (both fractional and in PDB coordinates) spanned by the model are also listed.

25.2.1.13.2. NC symmetry operator conversions

| top | pdf |

The program O_TO_SP is provided to convert NC symmetry operators expressed in terms of a 3 × 3 rotation matrix and 1 × 3 translation vector to the PHASES-style spherical polar system described earlier. Although originally written to convert the transformation operator as defined in the O program (Jones et al., 1991[link]), the procedure works for any rotation or translation operator expressed in this form, provided that the operator is applicable to Cartesian coordinates in Å orthogonalized as in the Protein Data Bank (Bernstein et al., 1977[link]).

25.2.1.13.3. Binary or formatted file conversions

| top | pdf |

For efficiency, structure-factor files used within the package are binary; however, the program RD31 is provided to read these binary files and convert them to formatted files that can be examined and possibly edited by the user. The indices, amplitudes, phases, figures of merit, phase probability distribution coefficients, and markers indicating which reflections are centric along with the allowed phase values are thus made readily accessible. A corresponding program, MK31B, is also provided to reverse the process; it reads the formatted (and possibly edited) versions of the structure-factor files and generates the appropriate binary-file equivalents. Additionally, the program XPL_PHI is supplied to convert the binary structure-factor files to a form readable by the X-PLOR program (Brünger et al., 1987[link]) in order to facilitate complete model refinement. Phase and figure-of-merit information are also passed to the output file, allowing refinement with phase restraints if desired.

25.2.1.13.4. Importing phase information

| top | pdf |

The program IMPORT allows users to `import' phase information obtained from programs external to the package so it can be used for subsequent calculations within the package. For example, one can use phase and probability distribution information obtained elsewhere to initiate solvent flattening, negative-density truncation and/or NC symmetry averaging within PHASES, or simply to generate and display maps with MAPVIEW or the other graphics programs. Reflection indices, the observed structure-factor amplitude, figure of merit, phase and phase probability distribution coefficients must be supplied, although free format can be used.

25.2.1.13.5. Phase set comparisons

| top | pdf |

The program PSTATS compares phases in two different structure-factor files. It lists mean phase differences as a function of d spacing for common reflections. The program is very useful for comparing results from different phasing strategies and for testing new procedures against error-free phases. It can also be used to check for convergence in iterative procedures or to assess the relative contributions of phase sets during phase combination.

25.2.2. DM/DMMULTI software for phase improvement by density modification

| top | pdf |
K. D. Cowtan,b* K. Y. J. Zhangc and P. Maind

25.2.2.1. Introduction

| top | pdf |

DM is an automated procedure for phase improvement by iterated density modification. It is used to obtain a set of improved phases and figures of merit, using as a starting point the observed diffraction amplitudes and some initial poor estimates for the phases and figures of merit. DM improves the phases through an alternate application of two processes: real-space electron-density modification and reciprocal-space phase combination. DM can perform solvent flattening, histogram matching, multi-resolution modification, averaging, skeletonization and Sayre refinement, as well as conventional or reflection-omit phase combination. Solvent and averaging masks may be input by the user or calculated automatically. Averaging operators may be refined within the program. Multiple averaging domains may be averaged using different operators.

DMMULTI is a modified version of the DM software that can perform density modification simultaneously across multiple crystal forms. The procedure is general, handling an arbitrary number of domains appearing in an arbitrary number of crystal forms. Initial phases may be provided for one or more crystal forms; however, improved phases are calculated in every crystal form.

DM and DMMULTI are distributed as a part of the CCP4 suite of software for protein crystallography (Collaborative Computational Project, Number 4, 1994). The theoretical and algorithmic bases for the DM and DMMULTI software suites are reviewed in Chapter 15.1[link] . In this chapter, some specific issues concerning the programs are described, including program operation, data preparation, choices of modes and code description.

25.2.2.2. Program operation

| top | pdf |

DM and DMMULTI are largely automatic; in order to perform a phase-improvement calculation only two tasks are required of the user:

  • (1) Provide the input data. These must include the reflection data and solvent content, and may also include averaging operators, solvent mask and averaging domain masks.

  • (2) Select the appropriate density modifications and the phase-combination mode to be used in the calculation.

DM and DMMULTI can run with the minimum input above, since the optimum choices for a whole range of parameters are set in the program defaults. For some special problems it may be useful to control the program behaviour in more detail; this is possible through a wide range of keywords to override the defaults. These are all detailed in the documentation supplied with the software.

25.2.2.3. Preparation of input data

| top | pdf |

Input data are provided by two routes: numerical parameters, such as solvent content and averaging operators, are included in the command file using appropriate keywords, whereas reflections and masks are referenced by giving their file names on the command line. In the simplest case; for example a solvent-flattening and histogram-matching calculation, all that is required is an initial reflection file and an estimate of the solvent content.

Use all available data: The reflection file must be in CCP4 `MTZ' format, and contain at least the structure-factor amplitudes, phase estimates and figures of merit. If the phase estimates are obtained from a homologous structure by molecular replacement, the figures of merit can be generated by the SIGMAA program (Read, 1986[link]). When the phases are estimated using a single isomorphous derivative (SIR), it is recommended that Hendrickson–Lattman coefficients (Hendrickson & Lattman, 1970[link]) are used to represent the phase estimate instead of the figure of merit. Hendrickson–Lattman coefficients can represent the bimodal distribution of the SIR phases, whereas the figure of merit can only represent the unimodal distribution of the average of two equally probable phase choices. It is recommended that a reflection file containing every possible reflection is used. The low-resolution data should be included since they provide a significant amount of information on the protein–solvent boundary. The high-resolution data without phase estimates should also be included since their phases can be estimated by DM. Phase extension can usually improve the original phases further compared to phase refinement only. Unobserved reflections are marked by a missing number flag. This is important for the preservation of the free-R reflections. It also enables DM to extrapolate missing reflections from density constraints and increases the phase improvement power.

The estimation of solvent content: The solvent content, [C_{\rm solv}], can be obtained by various experimental methods, such as the solvent dehydration method and the deuterium exchange method (Matthews, 1974[link]). It can also be estimated through [C_{\rm solv} = 1 - (NV_{a}ML/V). \eqno(25.2.2.1)] Here, N is the total number of atoms, including hydrogen atoms, in one protein molecule. Va is the average volume occupied by each atom, which is estimated to be approximately 10 Å3 (Matthews, 1968[link]). M is the number of molecules per asymmetric unit. L is the number of asymmetric units in the cell. V is the unit-cell volume. The correctly estimated solvent content should be entered in the program with the SOLC keyword, since this will be used not only to find the solvent–protein boundary but also to scale the input structure-factor amplitudes. If it is desirable to use a more conservative solvent mask in order to prevent clipping of protein densities, especially in the flexible loop regions, different solvent and protein fractions should be specified using the SOLMASK keyword.

Solvent mask: A solvent mask may be supplied; it may be used for the entire calculation or updated after several cycles. The solvent mask usually divides the cell into protein and solvent regions; however it is also possible to specify excluded regions which are unknown. If no solvent mask is supplied, it will be calculated by a modified Wang–Leslie procedure (Wang, 1985[link]; Leslie, 1987[link]) and updated as the phase-improvement calculation progresses.

Averaging operators: In an averaging calculation, the averaging operators must be supplied; these are typically obtained by rotation and translation searches using a program such as AMoRe (Navaza, 1994[link]) or X-PLOR (Brünger, 1992a[link]). If the coordinates of several heavy atoms are known, they can be used to calculate the noncrystallographic symmetry (NCS) operators. If a partial model can be built into the density, structure-superposition programs, such as LSQKAB (Kabsch, 1976[link]), can be used to obtain the rotation and translation matrices that relate different molecules in the asymmetric unit. This can also be achieved through the program O using the `lsq_explicit' command (Jones et al., 1991[link]). The averaging operators can be further refined in DM by minimizing the residual between NCS related densities.

Averaging mask: An averaging mask may be supplied; this is distinct from the solvent mask, allowing for parts of the protein to remain unaveraged if required. If no averaging mask is supplied, the mask will be calculated by a local-correlation approach (Cowtan & Main, 1998[link]; Vellieux et al., 1995[link]). If multiple domains are to be averaged with different averaging operators (Schuller, 1996[link]), then one mask must be specified for each averaging domain. When averaging molecules related by improper NCS operations, the averaging mask must be in accord with the NCS operators provided. For example, if the supplied NCS matrix maps molecule A to molecule B, then the averaging mask must cover the volume occupied by molecule A rather than molecule B.

Multi-crystal averaging: In the case of a multi-crystal averaging calculation, one reflection file is provided for each crystal form (however, initial phases are not required in every crystal form), and one reflection file will be output for each crystal form containing the improved phases. One mask is required per averaging domain; thus, in general, only a single mask is required. This may be defined for any crystal form or in an arbitrary crystal space of its own. Averaging operators are then provided to map the mask into each of the crystal forms.

Solvent and averaging masks that are calculated within the program may be output for subsequent analysis. Refined averaging operators are also output. The input and output data for a simple DM calculation, a DM averaging calculation and a DMMULTI multi-crystal averaging calculation are shown in Figs. 25.2.2.1(a)[link], (b)[link] and (c)[link], respectively.

[Figure 25.2.2.1]

Figure 25.2.2.1 | top | pdf |

(a) Input and output data for a DM calculation with no averaging. Light outlines indicate optional information. (b) Input and output data for a DM averaging calculation: for a single averaging domain, the averaging mask may be calculated automatically. For multi-domain averaging, all domain masks must be given. (c) Input and output data for DMMULTI. An averaging mask (or masks, for multiple domains) must be provided.

25.2.2.4. Choice of modes

| top | pdf |

Two major choices have to be made in a DM run. They are the real-space density-modification modes and reciprocal-space phase-combination modes. Moreover, the phase-extension schemes can be selected if needed. This can also be left to the program, which uses its default automatic mode for phase extension. The choices of various modes are described in the following sections.

25.2.2.4.1. Density-modification modes

| top | pdf |

The following density-modification modes (specified by the MODE keyword) are provided by DM:

  • (1) Solvent flattening: This is the most common density-modification technique and is powerful for improving phases at fixed resolution, but weaker at extending phases to higher resolution. Its phasing power is highly dependent on the solvent content. Solvent flattening can be applied at comparatively low resolutions, down to around 5.0 Å.

  • (2) Histogram matching: This method is applied only to the density in the protein region. This method is weaker than solvent flattening for improving phases, but is much more powerful at extending phases to higher resolutions. This is due to a unique feature of histogram matching which uses a resolution-dependent target for phase improvement. The phasing power of histogram matching is inversely related to the solvent content. Therefore, histogram matching plays a more important role in phase improvement when the solvent content is low. Histogram matching works to as low as 4.0 Å, but does no harm below that. Histogram matching should probably be applied as a matter of course in any case where the structure is not dominated by a large proportion of heavy-metal atoms. Even in this case, histogram matching may be applied by defining a solvent mask with solvent, protein and excluded regions.

  • (3) Multi-resolution modification: This method controls the level of detail in the map as a function of resolution by applying histogram matching and solvent flattening at multiple resolutions. This strengthens phase improvement at fixed resolution, although it generally improves phase-extension calculations too.

  • (4) Noncrystallographic symmetry averaging: Averaging is one of the most powerful techniques available for improving phases and is applicable even at very low resolutions. In extreme cases, averaging may be used to achieve an ab initio structure solution (Chapman et al., 1992[link]; Tsao et al., 1992[link]). It should therefore be applied whenever it is present and the operators can be determined.

  • (5) Skeletonization: Iterative skeletonization is the process of tracing a `skeleton' of connected densities through the map and then building a new map by filling density around this skeleton. The implementation in DM is adapted for use on poor maps, where it is sometimes but not always of use. To bring out side chains and missing loops, the ARP program (Lamzin & Wilson, 1997[link]) is more suitable.

  • (6) Sayre's equation: This method is more widely used in small-molecule calculations, and is very powerful at better than 2.0 Å resolution and when there are no heavy atoms in the structure. However, its phasing power is lost quickly as resolution decreases beyond 2.0 Å. The calculation takes significantly longer than other density-modification modes.

The most commonly used modes are solvent flattening and histogram matching – these give a good first map in most cases. Recently, multi-resolution modification has been added to this list. Averaging is applied whenever possible. Skeletonization and Sayre's equation are generally only applied in special situations.

25.2.2.4.2. Phase-combination modes

| top | pdf |

Density-modification calculations are somewhat prone to producing grossly overestimated figures of merit (Cowtan & Main, 1996[link]). Users should be aware of this. In general the phases and figures of merit produced by density-modification calculations should only be used for the calculation of weighted [F_{o}] maps. They should not be used for the calculation of difference maps or used in refinement or other calculations (the REFMAC program is an exception, containing a mechanism to deal with this form of bias). The use of [2F_{o} - F_{c}]-type maps should be avoided when the calculated phases are from density modification, since they are dependent on two assumptions, neither of which hold for density modification: that the current phases are very close to being correct and that the calculated amplitudes may only approach the observed values as the phase error approaches zero.

To limit the problems of overestimation, three phase-combination modes are provided (controlled by the COMBINE keyword):

  • (1) Free-Sim weighting: This is the simplest mode to use. Although convergence is weaker than the reflection-omit mode, the calculation never overshoots the best map. If there is averaging information, then convergence is much stronger and the phase-combination scheme is much less important. In addition, phase relationships in reciprocal space limit the effectiveness of the reflection-omit scheme. Therefore, the free-Sim weighting scheme should usually be used when there is averaging.

  • (2) Reflection-omit: The combination of a reciprocal-space omit procedure with SIGMAA phase combination (Read, 1986[link]) leads to much better maps when applying solvent flattening and histogram matching. However, the omit calculation is computationally costly and introduces a small amount of noise into the maps, thus the phases can get worse if the calculation is run for too many cycles. A real-space free-R indicator (Abrahams & Leslie, 1996[link]) is therefore used to stop the calculation at an appropriate point.

  • (3) Perturbation-γ correction: This new approach is an extension of the γ correction of Abrahams (1997)[link] to arbitrary density-modification methods. The results are a good approximation to a perfect reflection-omit scheme and required considerably less computation. This is therefore the preferred mode for all calculations.

In the case of a molecular-replacement calculation or high noncrystallographic symmetry, it may be desirable only to weight the modified phases and not to recombine them back with the initial phases so that any initial bias may be overcome. In the case of high noncrystallographic symmetry, it may also be possible to restore missing reflections in both amplitude and phase. Options are available for both these situations.

25.2.2.4.3. Phase-extension schemes

| top | pdf |

When performing phase extension, the order in which the structure factors are included will affect the final accuracy of the extended phases. The phases obtained from previous cycles of phase extension will be included in the calculation of new phases for the unphased structure factors in the next cycle. A reflection with more accurately determined phases might enhance the phase-extension power of the original set of reflections, whereas a reflection with less accurately determined phases might corrupt the phase-extension power of the original set of reflections and make the phase extension deteriorate quickly. The factors that might affect phase extension are the structure-factor amplitudes, the resolution shell and the figure of merit. Based on the above considerations, the following phase-extension schemes are provided in DM:

  • (1) Extension by resolution shell: This performs phase extension in resolution steps, starting from the low-resolution data, and extends the phase to the high-resolution limit of the data or that specified by the user. Structure factors are related by the reciprocal-space density-modification function that is dominated by low-resolution terms, as shown by equation (15.1.3.2[link] ) and Fig. 15.1.3.1[link] in Chapter 15.1. This means that only structure factors in a small region of reciprocal space are related. Thus, when initial phases are only available at low resolution, phase extension is performed by inclusion of successive resolution shells. In the case of fourfold or higher NCS, this can allow extension to 2 Å starting from initial phasing at 6 Å or worse.

  • (2) Extension in structure-factor amplitude steps: In this mode, those reflections with larger amplitudes are added first, gradually extending to those reflections with smaller amplitudes in many steps. The contribution of a reflection to the electron density is proportional to the square of its structure-factor amplitude according to Parseval's theorem, as shown in equation (25.2.2.2)[link]. This favours the protocol of extending the stronger reflections first so that they can be more reliably estimated. These stronger reflections will be used to phase relatively weaker subsequent reflections.

  • (3) Extension in figure-of-merit step: To extend phases for those structure factors with experimentally measured, albeit less accurate, phases and figures of merit, the reflections can be added in order of their figure of merit, starting from the highest to the lowest. It is advantageous to use the more reliably estimated phases with higher figure of merit to phase those reflections with lower figure of merit. This can be useful when working with initial phasing from MAD or MR sources.

  • (4) Automatic mode: This combines the previous three extension schemes. The program automatically works out the optimum combination of the above three schemes according to the density-modification mode, the phase-combination mode and the nature of the input reflection data. The automatic mode is the default and is the recommended mode of choice unless specific circumstances warrant a different choice.

  • (5) All reflection mode: One advantage of the reflection-omit and perturbation-γ methods is that the strength of extrapolation of a structure-factor amplitude is a good indicator of the reliability of its corresponding phase. As a result, a phase-extension scheme is unnecessary in reflection-omit calculations; all reflections may be included from the first cycle.

25.2.2.5. Code description

| top | pdf |

The program was designed to be run largely automatically with minimal user intervention. This is achieved by using extensive default settings and by automatic selection of options based on the data used. The program is also modular by design so that additional density-modification methods can be incorporated easily.

A simplified flow diagram for DM is shown in Fig. 25.2.2.2(a)[link]. When a reflection-omit calculation is performed, an additional loop is introduced, shown in Fig. 25.2.2.2(b)[link]. The Sayre's equation calculation adds another level of complexity, described in Zhang & Main (1990b)[link]. Skeletonization imposes the protein histogram and solvent flatness implicitly and so is performed, if necessary, every second or third cycle in place of solvent flattening and histogram matching. Simplified conceptual and actual flow diagrams for DMMULTI are shown in Figs. 25.2.2.3(a)[link] and (b)[link].

[Figure 25.2.2.2]

Figure 25.2.2.2 | top | pdf |

(a) Flow chart for a simple DM calculation with free-Sim phase combination. (b) Flow chart for a simple DM calculation with reflection-omit phase combination.

[Figure 25.2.2.3]

Figure 25.2.2.3 | top | pdf |

(a) Conceptual flowchart for a DMMULTI multi-crystal calculation. (b) Actual flow chart for a DMMULTI multi-crystal calculation.

Many of the basic approaches used in DM and DMMULTI are described in Chapter 15.1[link] . Some practical aspects of the application and combination of these approaches are described here.

25.2.2.5.1. Scaling

| top | pdf |

All forms of map modification are affected by the overall temperature factor of the data, and histogram matching in particular is critically dependent on the accurate determination of the scale factor. Wilson statistics have been found inadequate for scaling in this case, especially when the data resolution is worse than 3 Å, because of the dip in scattering below 5 Å.

More accurate estimates of the scale and temperature factors may be achieved by fitting the data to a semi-empirical scattering curve (Cowtan & Main, 1998[link]). This curve is prepared using Parseval's theorem, which relates the sum of the intensities to the variance of the map: [\sigma_{\rho}^{2} = {1\over V^2}\sum\limits_{{\bf h} \neq 000}\displaystyle |F({\bf h})|^{2}. \eqno(25.2.2.2)] Thus, the sum of the intensities in a particular resolution shell is proportional to the difference in variance of maps calculated with and without that shell of data. The empirical curve is therefore calculated from the variance in the protein regions of a group of known structures, calculated as a function of resolution. The curve is scaled to the protein volume of the current structure, and a correction is made for the solvent, which is assumed to be flat.

The overall temperature factor is removed, and an absolute scale is imposed by fitting the data to this curve. The use of sharpened F's (with no overall temperature factor) is necessary for histogram matching and often increases the power of averaging for phase extension.

Since the solvent content is used in scaling the data, it is important that this value be entered correctly. However, the volume of the solvent mask may be varied independently of the true solvent content, as discussed in Section 25.2.2.3[link].

25.2.2.5.2. Solvent-mask determination

| top | pdf |

If the user does not supply a solvent mask, the solvent mask is calculated by Wang's (1985) method, using the reciprocal-space approach of Leslie (1987)[link]. A number of variants on this algorithm are implemented; however, the parameter that affects the quality of the solvent mask most dramatically is the radius of the smoothing function (Chapter 15.1[link] ). This parameter may be estimated empirically by [r_{\rm Wang} = 2r_{\max} \overline{w}^{1/3}, \eqno(25.2.2.3)] where [r_{\max}] is the resolution limit of the observed amplitudes, and [\overline{w}] is the mean figure of merit over the same reflections (with w = 0 for unphased reflections).

Once the smoothed map has been determined, cutoff values are chosen to divide the map into protein and solvent regions. If the protein boundary is poorly defined, the user may specify protein, solvent and excluded volumes, in which case two cutoffs are specified and the intermediate region is marked as neither protein nor solvent.

25.2.2.5.3. Averaging-mask determination

| top | pdf |

If the user does not supply an averaging mask, it is determined by a local correlation method (Vellieux et al., 1995[link]). A large region covering 27 unit cells is selected, and the local correlation between the maps before and after rotation by one of the noncrystallographic symmetry operators is calculated. The largest contiguous region that is in agreement among different NCS operators is isolated from the local correlation map, and a finer local correlation map is calculated over this volume. This process is iterated until a good mask with a detailed boundary is found.

This approach is fully automatic, except in the case where a noncrystallographic symmetry operator intersects a crystallographic symmetry operator, in which case the mask is not uniquely defined, and some user intervention may be required. The method is robust, and by increasing the radius of the sphere within which the local correlation is calculated, it may be used with very poor maps (Cowtan & Main, 1998[link]). The method is easily extended to include information from multiple averaging operators.

25.2.2.5.4. Fourier transforms

| top | pdf |

For simplicity of coding, all Fourier transforms are performed in core using real-to-Hermitian and Hermitian-to-real fast Fourier transforms (FFTs). The data are expanded to space group P1 before calculating a map and averaged back to a reciprocal asymmetric unit after inverse transformation. Most of the map modifications preserve crystallographic symmetry, so restricted phases are not constrained except during phase combination.

25.2.2.5.5. Histogram matching

| top | pdf |

The target histograms are calculated from the protein regions of several stationary-atom structures at resolutions from 6 to 1.5 Å, according to the method described by Zhang & Main (1990a)[link]. The histogram variances should be consistent with the map variances used in scaling the data. The resolution of the target histogram can be accurately matched to the data resolution by averaging the target histograms on either side of the current resolution.

25.2.2.5.6. Averaging

| top | pdf |

Averaging is performed using a single-step approach (Rossmann et al., 1992[link]), in which every copy of the molecule in a `virtual' asymmetric unit is averaged with every other copy. Density values are obtained at non-grid positions using a 27-point quadratic spectral spline interpolation. A sharpened map is first calculated by dividing by the Fourier transform of the quadratic spline function. The same spline function is then convoluted with the sharpened map to obtain the density value at an arbitrary coordinate (Cowtan & Main, 1998[link]). This approach gives very accurate interpolation from a coarse grid map with relatively little computation and additionally provides gradient information for the refinement of averaging operators.

25.2.2.5.7. Multi-crystal averaging

| top | pdf |

The multi-crystal averaging calculation in DMMULTI is equivalent to several single-crystal averaging calculations running simultaneously, with the exception that during the averaging step, the molecule density is averaged across every copy in every crystal form. This average is weighted by the mean figure of merit of each crystal form; this allows the inclusion of unphased crystal forms, since in the first cycle they will have zero weight and therefore not disrupt the phasing that is already present. In subsequent cycles, the unphased form contains phase information from the back-transformed density.

This technique can be extremely useful, since adding a new crystal form usually provides considerably more phase information than adding a new derivative if the cross-rotation and translation functions can be solved.

In the multi-crystal case, averaging is performed using a two-step approach, first building an averaged molecule from all the copies in all crystal forms, then replacing the density in each crystal form with the averaged values. This approach is computationally more efficient when there are many copies of the molecule.

The conceptual flow chart of simultaneous density-modification calculations across multiple crystal forms is shown in Fig. 25.2.2.3(a)[link]; in practice, this scheme is implemented using a single process and looping over every crystal form at each stage (Fig. 25.2.2.3b)[link]. Maps are reconstructed from a large data object containing all the reflection data in every crystal form. Averaging is performed using a second data object containing maps of each averaging domain. By this means, an arbitrary number of domains may be averaged across an arbitrary number of crystal forms.

Multi-crystal averaging has been particularly successful in solving structures from very weak initial phasing, since the data redundancy is usually higher than for single-crystal problems.

25.2.3. The structure-determination language of the Crystallography & NMR System

| top | pdf |
A. T. Brunger,v* P. D. Adams,e W. L. DeLano,f P. Gros,g R. W. Grosse-Kunstleve,e J.-S. Jiang,h N. S. Pannu,i R. J. Read,j L. M. Ricek and T. Simonsonl

25.2.3.1. Introduction

| top | pdf |

We have developed a new and advanced software system, the Crystallography & NMR System (CNS), for crystallographic and NMR structure determination (Brünger et al., 1998[link]). The goals of CNS are: (1) to create a flexible computational framework for exploration of new approaches to structure determination; (2) to provide tools for structure solution of difficult or large structures; (3) to develop models for analysing structural and dynamical properties of macromolecules; and (4) to integrate all sources of information into all stages of the structure-determination process.

To meet these goals, algorithms were moved from the source code into a symbolic structure-determination language which represents a new concept in computational crystallography. The high-level CNS computing language allows definition of symbolic target functions, data structures, procedures and modules. The CNS program acts as an interpreter for the high-level CNS language and includes hard-wired functions for efficient processing of computing-intensive tasks. Methods and algorithms are therefore more clearly defined and easier to adapt to new and challenging problems. The result is a multi-level system which provides maximum flexibility to the user (Fig. 25.2.3.1)[link]. The CNS language provides a common framework for nearly all computational procedures of structure determination. A comprehensive set of crystallographic procedures for phasing, density modification and refinement has been implemented in this language. Task-oriented input files written in the CNS language, which can also be accessed through an HTML graphical interface (Graham, 1995[link]), are available to carry out these procedures.

[Figure 25.2.3.1]

Figure 25.2.3.1 | top | pdf |

CNS consists of five layers which are under user control. The high-level HTML graphical interface interacts with the task-oriented input files. The task files use the CNS language and the modules. The modules contain CNS language statements. The CNS language is interpreted by the CNS Fortran77 program. The program performs the data manipulations, data operations and `hard-wired' algorithms.

25.2.3.2. The CNS language

| top | pdf |

One of the key features of the CNS language is symbolic data structure manipulation, for example, [\tt\displaylines{\hbox{xray}\hfill\cr\quad\hbox{do } (\hbox{pa}=-2* (\hbox{amplitude(fp)}\;\hat{}\;2 + \hbox{amplitude(fh)}\;\hat{}\;2\hfill\cr{\hbox to 3.8pc{}} -\hbox{amplitude(fph)}\ \hat{}\ 2)* \hbox{amplitude(fp)}\hfill\cr{\hbox to 3.8pc{}} *\;\hbox{real(fh)}/(3* \hbox{v}\; \hat{}\; 2 + 4* (\hbox{amplitude(fph)}\;\hat{}\;2\hfill\cr{\hbox to 3.8pc{}}+\hbox{sph}\;\hat{}\; 2)* \hbox{v})) \hbox{ (acentric)} \hfill\cr \hbox{end}\hfill {\rm(25.2.3.1)}}] which is equivalent to the following mathematical expression for all acentric indices h, [p_{a}({\bf h}) = 2 {- [|{\bf f}_{p}({\bf h})|^{2} + |{\bf f}_{h}({\bf h})|^{2} - |{\bf f}_{ph}({\bf h})|^{2}] |{\bf f}_{p}({\bf h})| \{[{\bf f}_{h}({\bf h}) + {\bf f}_{h}({\bf h})^{*}]/2\} \over 3 v({\bf h})^{2} + 4 [|{\bf f}_{ph}({\bf h})|^{2} + s_{ph}({\bf h})^{2}] v({\bf h})},\eqno(25.2.3.2)] where [{\bf f}_{p}] [`fp' in equation (25.2.3.1)[link]] is the `native' structure-factor array, [{\bf f}_{ph}] [`fph' in equation (25.2.3.1)[link]] is the derivative structure-factor array, [s_{ph}] [`sph' in equation (25.2.3.1)[link]] is the corresponding experimental σ, v is the expectation value for the lack of closure (including lack of isomorphism and errors in the heavy-atom model), and [{\bf f}_{h}] [`fh' in equation (25.2.3.1)[link]] is the calculated heavy-atom structure-factor array. This expression computes the [A_{\rm iso}] coefficient of the phase probability distribution for single isomorphous replacement described by Hendrickson & Lattman (1970)[link] and Blundell & Johnson (1976)[link].

The expression in equation (25.2.3.1)[link] is computed for the specified subset of reflections `(acentric)'. This expression means that only the selected (in this case all acentric) reflections are used. More sophisticated selections are possible, e.g. [\tt\eqalignno{&(\hbox{amplitude(fp)}\gt 2* \hbox{sh and amplitude(fph)}\gt 2* \hbox{sph}&\cr &\quad\hbox{and d}\gt = 3) &{\rm(25.2.3.3)}\cr}] selects all reflections with Bragg spacing, d, greater than 3 Å for which both native (fp) and derivative (fph) amplitudes are greater than two times their corresponding σ values (`sh' and `sph', respectively). Extensive use of this structure-factor selection facility is made for cross-validating statistical properties, such as R values (Brünger, 1992b)[link], [\sigma_{A}] values (Kleywegt & Brünger, 1996[link]; Read, 1997[link]) and maximum-likelihood functions (Pannu & Read, 1996a[link]; Adams et al., 1997[link]).

Similar operations exist for electron-density maps, e.g. [\tt\eqalignno{ &\hbox{xray} &\cr &\quad \hbox{do}\ (\hbox{map}=0)\ (\hbox{map}\lt 0.1) &\cr &\hbox{end} &{\rm (25.2.3.4)}\cr}] is an example of a truncation operation: all map values less than 0.1 are set to 0. Atoms can be selected based on a number of atomic properties and descriptors, e.g. [\tt\eqalignno{ \hbox{do}\ (\hbox{b}=10) &(\hbox{residue 1:40 and} &\cr &\quad (\hbox{name ca or name n or name c or name o})) &\cr&&{\rm(25.2.3.5)}\cr}] sets the B factors of all polypeptide backbone atoms of residues 1 through 40 to 10 Å2.

Operations exist between data structures, e.g. real- and reciprocal-space arrays, and atom properties. For example, Fourier transformations between real and reciprocal space can be accomplished by the following CNS commands: [\tt\eqalignno{ &\hbox{xray} &\cr &\quad \hbox{mapresolution infinity 3}. &\cr &\quad \hbox{fft grid 0.3333 end} &\cr &\quad \hbox{do}\ (\hbox{map}=\hbox{ft}(\hbox{f\_cal}))\ (\hbox{acentric}) &\cr &\hbox{end} &{\rm(25.2.3.6)}\cr}] which computes a map on a 1 Å grid by Fourier transformation of the [`\hbox{f}\_\hbox{cal'}] array for all acentric reflections.

Atoms can be associated with calculated structure factors, e.g. [\tt\hbox{associate f\_cal (residue 1:50)} \eqno(25.2.3.7)] This statement will associate the reciprocal-space array `f_cal' with the atoms belonging to residues 1 through 50. These structure-factor associations are used in the symbolic target functions described below.

There are no predefined reciprocal- or real-space arrays in CNS. Dynamic memory allocation allows one to carry out operations on arbitrarily large data sets with many individual entries (e.g. derivative diffraction data) without the need to recompile the source code. The various reciprocal-space structure-factor arrays must therefore be declared and their type specified prior to invoking them. For example, a reciprocal-space array with real values, such as observed amplitudes, is declared by [{\tt\hbox{declare name} = \hbox{fobs type} = \hbox{real domain} = \hbox{reciprocal end}} \eqno(25.2.3.8)] Reciprocal-space arrays can be grouped. For example, Hendrickson & Lattman (1970)[link] coefficients are represented as a group of four reciprocal-space structure-factor arrays, [\tt\eqalignno{&\hbox{group type} = \hbox{hl object} = \hbox{pa object} = \hbox{pb}&\cr&\quad\hbox{object} = \hbox{pc object} = \hbox{pd end} &{\rm(25.2.3.9)}}] where `pa', `pb', `pc' and `pd' refer to the individual arrays. This group statement indicates to CNS that the specified arrays need to be transformed together when reflection indices are changed, e.g. during expansion of the diffraction data to space group P1.

25.2.3.3. Symbols and parameters

| top | pdf |

The CNS language supports two types of data elements which may be used to store and retrieve information. Symbols are typed variables, such as numbers, character strings of restricted length and logicals. Parameters are untyped data elements of arbitrary length that may contain collections of CNS commands, numbers, strings, or symbols.

Symbols are denoted by a dollar sign ($), and parameters by an ampersand (&). Symbols and parameters may contain a single data element, or they may be a compound data structure of arbitrary complexity. The hierarchy of these data structures is denoted using a period (.). Figs. 25.2.3.2(a)[link] and (b)[link] demonstrate how crystal-lattice information can be stored in compound symbols and parameters, respectively. The information stored in symbols or parameters can be retrieved by simply referring to them within a CNS command: the symbol or parameter name is substituted by its content. Symbol substitution of portions of the compound names (e.g. [\hbox{`\&crystal}\_\hbox{lattice}.\hbox{unit}\_\hbox{cell}.\hbox{\$} \hbox{para'}]) allows one to carry out conditional and iterative operations on data structures, such as matrix multiplication.

[Figure 25.2.3.2]

Figure 25.2.3.2 | top | pdf |

Examples of compound symbols and compound parameters. (a) The `evaluate' statement is used to define typed symbols (strings, numbers and logicals). Symbol names are in bold. (b) The `define' statement is used to define untyped parameters. Each parameter entry is terminated by a semicolon. The compound base name `crystal_lattice' has a number of sub-levels, such as `space_group' and the `unit_cell' parameters. `unit_cell' is itself base to a number of sub-levels, such as `a' and `alpha'. Parameter names are in bold.

25.2.3.4. Statistical functions

| top | pdf |

The CNS language contains a number of statistical operations, such as binwise averages and summations. The resolution bins are defined by a central facility in CNS.

Fig. 25.2.3.3[link] shows how [\sigma_{A}], [\sigma_{\Delta}] and D (Read, 1986[link], 1990[link]) are computed from the observed structure factors (`fobs') and the calculated model structure factors (`fcalc') using the CNS statistical operations. The first five operations are performed for the reflections in the test set, while the last three operations expand the results to all reflections. The `norm' function computes normalized structure-factor amplitudes for the specified arguments. The `sigacv' function evaluates [\sigma_{A}] from the normalized structure factors. The `save' function computes the statistical average [\hbox{save}(\;f) = {\textstyle\sum_{hkl}\displaystyle f_{hkl} (w/\varepsilon) \over \textstyle\sum_{hkl}\displaystyle w}, \eqno(25.2.3.10)] where w is 1 and 2 for centric and acentric reflections, respectively, and [epsilon] is the statistical weight. The averages are computed binwise, and the result for a particular bin is stored in all selected reflections belonging to the bin.

[Figure 25.2.3.3]

Figure 25.2.3.3 | top | pdf |

Example for statistical operations provided by the CNS language. `norm', `sigacv', `save' and `sum' are functions that are computed internally by the CNS program. Binwise operations are in italics (`sigacv', `save' and `sum'). The result for a particular bin is stored in all elements belonging to the bin. The [\sigma_{A}] (`sigmaA') parameters are computed in binwise resolution shells. The [\sigma_{\Delta}] (`sigmaD') and D parameters are then computed from [\sigma_{A}] and binwise averages involving [|{\bf F}_{o}|^{2}] and [|{\bf F}_{c}|^{2}]. The binwise results are expanded to all reflections by the last three statements. `test' is an array that is 1 for all reflections in the test set and 0 otherwise. `sum' is a binwise operation on all reflections with the same partitioning used for the test set.

25.2.3.5. Symbolic target function

| top | pdf |

One of the key innovative features of CNS is the ability to symbolically define target functions and their first derivatives for crystallographic searches and refinement. This allows one conveniently to implement new crystallographic methodologies as they are being developed.

The power of symbolic target functions is illustrated by two examples. In the first example, a target function is defined for simultaneous heavy-atom parameter refinement of three derivatives. The sites for each of the three derivatives can be disjoint or identical, depending on the particular situation. For simplicity, the Blow & Crick (1959)[link] approach is used, although maximum-likelihood targets are also possible (see below). The heavy-atom sites are refined against the target [\eqalignno{ &\displaystyle\sum\limits_{hkl} {(|{\bf F}_{h_{1}} + {\bf F}_{p}| - |{\bf F}_{ph_{1}}|)^{2} \over 2 v_{1}} + {(|{\bf F}_{h_{2}} + {\bf F}_{p}| - |{\bf F}_{ph_{2}}|)^{2} \over 2 v_{2}} &\cr &\quad + {(|{\bf F}_{h_{3}} + {\bf F}_{p}| - |{\bf F}_{ph_{3}}|)^{2} \over 2v_{3}}. &(25.2.3.11)\cr}]

[{\bf F}_{h_{1}}], [{\bf F}_{h_{2}}] and [{\bf F}_{h_{3}}] are complex structure factors corresponding to the three sets of heavy-atom sites, [{\bf F}_{p}] represents the structure factors of the native crystal, [|{\bf F}_{ph_{1}}|], [|{\bf F}_{ph_{2}}|] and [|{\bf F}_{ph_{3}}|] are the structure-factor amplitudes of the derivatives, and [v_{1}], [v_{2}] and [v_{3}] are the variances of the three lack-of-closure expressions. The corresponding target expression and its first derivatives with respect to the calculated structure factors are shown in Fig. 25.2.3.4(a)[link]. The derivatives of the target function with respect to each of the three associated structure-factor arrays are specified with the `dtarget' expressions. The `tselection' statement specifies the selected subset of reflections to be used in the target function (e.g. excluding outliers), and the `cvselection' statement specifies a subset of reflections to be used for cross-validation (Brünger, 1992b[link]) (i.e. the subset is not used during refinement but only as a monitor for the progress of refinement).

[Figure 25.2.3.4]

Figure 25.2.3.4 | top | pdf |

Examples of symbolic definition of a refinement target function and its derivatives with respect to the calculated structure-factor arrays. (a) Simultaneous refinement of heavy-atom sites of three derivatives. The target function is defined by the `target' expression. `[\hbox{f}\_\hbox{h}\_1]', `[\hbox{f}\_\hbox{h}\_2]' and `[\hbox{f}\_\hbox{h}\_3]' (in bold) are complex structure factors corresponding to three sets of heavy atoms that are specified using atom selections [equation (25.2.3.7)[link]]. The target function and its derivatives with respect to the three structure-factor arrays are defined symbolically using the structure-factor amplitudes of the native crystal, `[\hbox{f}\_\hbox{p}]', those of the derivatives, `[\hbox{f}\_\hbox{ph}\_1]', `[\hbox{f}\_\hbox{ph}\_2]', `[\hbox{f}\_\hbox{ph}\_3]', the complex structure factors of the heavy-atom models, `[\hbox{f}\_\hbox{h}\_1]', `[\hbox{f}\_\hbox{h}\_2]', `[\hbox{f}\_\hbox{h}\_3]', and the corresponding lack-of-closure variances, `[\hbox{v}\_1]', `[\hbox{v}\_2]' and `[\hbox{v}\_3]'. The summation over the selected stucture factors (`tselection') is performed implicitly. (b) Refinement of two independent models against perfectly twinned data. `fcalc1' and `fcalc2' are complex structure factors for the models that are related by a twinning operation (in bold). The target function and its derivatives with respect to the two structure-factor arrays are explicitly defined.

The second example is the refinement of a perfectly twinned crystal with overlapping reflections from two independent crystal lattices. Refinement of the model is carried out against the residual [\textstyle\sum\limits_{hkl}\displaystyle |{\bf F}_{\rm obs} |- (|{\bf F}_{\rm calc1}|^{2} + |{\bf F}_{\rm calc2}|^{2})^{1/2}. \eqno(25.2.3.12)] The symbolic definition of this target is shown in Fig. 25.2.3.4(b)[link]. The twinning operation itself is imposed as a relationship between the two sets of selected atoms (not shown). This example assumes that the two calculated structure-factor arrays (`fcalc1' and `fcalc2') that correspond to the two lattices have been appropriately scaled with respect to the observed structure factors, and the twinning fractions have been incorporated into the scale factors. However, a more sophisticated target function could be defined which incorporates scaling.

A major advantage of the symbolic definition of the target function and its derivatives is that any arbitrary function of structure-factor arrays can be used. This means that the scope of possible targets is not limited to least-squares targets. Symbolic definition of numerical integration over unknown variables (such as phase angles) is also possible. Thus, even complicated maximum-likelihood target functions (Bricogne, 1984[link]; Otwinowski, 1991[link]; Pannu & Read, 1996a[link]; Pannu et al., 1998[link]) can be defined using the CNS language. This is particularly valuable at the prototype stage. For greater efficiency, the standard maximum-likelihood targets are provided through CNS source code which can be accessed as functions in the CNS language. For example, the maximum-likelihood target function MLF (Pannu & Read, 1996a[link]) and its derivative with respect to the calculated structure factors are defined as [\tt\eqalignno{\hbox{target} &= \hbox{(mlf (fobs,sigma,(fcalc + fbulk),} &\cr&\quad\hbox{d,sigma\_delta))} &\cr \hbox{dtarget} &= \hbox{(dmlf (fobs,sigma,(fcalc + fbulk),}\cr&\quad\hbox{d,sigma\_delta))} &{\rm(25.2.3.13)}\cr}] where `mlf( )' and `dmlf( )' refer to internal maximum-likelihood functions, `fobs' and `sigma' are the observed structure-factor amplitudes and corresponding σ values, `fcalc' is the (complex) calculated structure-factor array, `fbulk' is the structure-factor array for a bulk solvent model, and `d' and [`\hbox{sigma}\_\hbox{delta'}] are the cross-validated D and [\sigma_{\Delta}] functions (Read, 1990[link]; Kleywegt & Brünger, 1996[link]; Read, 1997[link]) which are precomputed prior to invoking the MLF target function using the test set of reflections. The availability of internal Fortran subroutines for the most computing-intensive target functions and the symbolic definitions involving structure-factor arrays allow for maximal flexibility and efficiency. Other examples of available maximum-likelihood target functions include MLI (intensity-based maximum-likelihood refinement), MLHL [crystallographic model refinement with prior phase information (Pannu et al., 1998[link])], and maximum-likelihood heavy-atom parameter refinement for multiple isomorphous replacement (Otwinowski, 1991[link]) and MAD phasing (Hendrickson, 1991[link]; Burling et al., 1996[link]). Work is in progress to define target functions that include correlations between different heavy-atom derivatives (Read, 1994[link]).

25.2.3.6. Modules and procedures

| top | pdf |

Modules exist as separate files and contain collections of CNS commands related to a particular task. In contrast, procedures can be defined and invoked from within any file. Modules and procedures share a similar parameter-passing mechanism for both input and output. Modules and procedures make it possible to write programs in the CNS language in a manner similar to that of a computing language, such as Fortran or C. CNS modules and procedures have defined sets of input (and output) parameters that are passed into them (or returned) when they are invoked. This enables long collections of CNS language statements to be broken down into modules for greater clarity of the underlying algorithm.

Parameters passed into a module or procedure inherit the scope of the calling task file or module, and thus they exhibit a behaviour analogous to most computing languages. Symbols defined within a module or procedure are purely local variables.

The following example shows how the unit-cell parameters defined above (Fig. 25.2.3.2b)[link] are passed into a module named `compute_unit_cell_volume' (Fig. 25.2.3.5)[link], which computes the volume of the unit cell from the crystal lattice parameters using well established formulae (Stout & Jensen, 1989[link]): [\tt\eqalignno{ \hbox{@compute\_unit\_cell\_volume } &\hbox{(cell} = \&\hbox{crystal\_lattice.unit\_cell\semi} &\cr &\quad \hbox{volume} =\$\hbox{cell\_volume;)} &\cr&&\rm{(25.2.3.14)}}]The parameter `volume' is equated to the symbol [`\$\hbox{cell}\_\hbox{volume'}] upon invocation in order to return the result (the unit-cell volume) from this module. Note that the use of compound parameters to define the crystal lattice parameters (Fig. 25.2.3.2b)[link] provides a convenient way to pass all required information into the module by referring to the base name of the compound parameter ([`\&\hbox{crystal}\_\hbox{lattice}.\hbox{unit}\_\hbox{cell'}]) instead of having to specify each individual data element.

[Figure 25.2.3.5]

Figure 25.2.3.5 | top | pdf |

Use of compound parameters within a module. This module computes the unit-cell volume (Stout & Jensen, 1989[link]) from the unit-cell geometry. Input and output parameter base names are in bold. Local symbols, such as cabg.1, are defined through `evaluate' statements. The result is stored in the parameter `&volume' which is passed to the invoking task file or module.

Fig. 25.2.3.6(a)[link] shows another example of a CNS module: the module named [`\hbox{phase}\_\hbox{distribution'}] computes phase probability distributions using the Hendrickson & Lattman formalism (Hendrickson & Lattman, 1970[link]; Hendrickson, 1979[link]; Blundell & Johnson, 1976[link]). An example for invoking the module is shown in Fig. 25.2.3.6(b)[link]. This module could be called from task files that need access to isomorphous phase probability distributions. It would be straightforward to change the module in order to compute different expressions for the phase probability distributions.

[Figure 25.2.3.6]

Figure 25.2.3.6 | top | pdf |

Example of (a) a CNS module and (b) the corresponding module invocation. Input and output parameters are in bold. The module invocation is performed by specifying the `@' character, followed by the name of the module file and the module parameter substitutions. The ampersand (&) indicates that the particular symbol (e.g. `&fp') is substituted with the specified value in the invocation statement [e.g. `fobs' in the case of `&fp' in (b)]. The module parameter substitution is performed literally, and any string of characters between the equal sign and the semicolon will be substituted.

A large number of additional modules are available for crystallographic phasing and refinement. CNS library modules include space-group information, Gaussian atomic form factors, anomalous-scattering components, and molecular parameter and topology databases.

25.2.3.7. Task files

| top | pdf |

Task files consist of CNS language statements and module invocations. The CNS language permits the design and execution of nearly any numerical task in X-ray crystallographic structure determination using a minimal set of `hard-wired' functions and routines. A list of the currently available crystallographic procedures and features is shown in Fig. 25.2.3.7[link].

[Figure 25.2.3.7]

Figure 25.2.3.7 | top | pdf |

Procedures and features available in CNS for structure determination by X-ray crystallography.

Each task file is divided into two main sections: the initial parameter definition and the main body of the task file. The definition section contains definitions of all CNS parameters that are used in the main body of the task file. Modification of the main body of the file is not required, but may be done by experienced users in order to experiment with new algorithms. The definition section also contains the directives that specify specific HTML features, e.g. text comments (indicated by [\{^{*} \ldots ^{*}\}] ), user-modifiable fields (indicated by [\{===\gt \}]), and choice boxes (indicated by [\{+ \hbox{ choice: } \ldots + \}]). Fig. 25.2.3.8[link] shows a portion of the `define' section of a typical CNS refinement task file.

[Figure 25.2.3.8]

Figure 25.2.3.8 | top | pdf |

Example of a typical CNS task file: a section of the top portion of the simulated-annealing refinement protocol which contains the definition of various parameters that are needed in the main body of the task file. Each parameter is indicated by a name, an equal sign and an arbitrary sequence of characters terminated by a semicolon (e.g. `a=61.76;'). The top portion of each task file also contains commands for the HTML interface embedded in comment fields (indicated by braces, [\{ \ldots \}]). The commands that can be modified by the user in the HTML form are in bold.

The task files produce a number of output files (e.g. coordinate, reflection, graphing and analysis files). Comprehensive information about input parameters and results of the task are provided in these output files. In this way, the majority of the information required to reproduce the structure determination is kept with the results. Analysis data are often given in simple columns and rows of numbers. These data files can be used for graphing, for example, by using commonly available spreadsheet programs. An HTML graphical output feature for CNS which makes use of these analysis files is planned. In addition, list files are often produced that contain a synopsis of the calculation.

25.2.3.8. HTML interface

| top | pdf |

The HTML graphical interface uses HTML to create a high-level menu-driven environment for CNS (Fig. 25.2.3.9)[link]. Compact and relatively simple Common Gateway Interface (CGI) conversion scripts are available that transform a task file into a form page and the edited form page back into a task file (Fig. 25.2.3.10[link]). These conversion scripts are written in PERL.

[Figure 25.2.3.9]

Figure 25.2.3.9 | top | pdf |

Example of a CNS HTML form page. This particular example corresponds to the task file in Fig. 25.2.3.8[link].

[Figure 25.2.3.10]

Figure 25.2.3.10 | top | pdf |

Use of the CNS HTML form page interface, emphasizing the correspondence between input fields in the form page and parameters in the task file.

A comprehensive collection of task files are available for crystallographic phasing and refinement (Fig. 25.2.3.7)[link]. New task files can be created or existing ones modified in order to address problems that are not currently met by the distributed collection of task files. The HTML graphical interface thus provides a common interface for distributed and `personal' CNS task files (Fig. 25.2.3.10).

25.2.3.9. Example: combined maximum-likelihood and simulated-annealing refinement

| top | pdf |

CNS has a comprehensive task file for simulated-annealing refinement of crystal structures using Cartesian (Brünger et al., 1987[link]; Brünger, 1988[link]) or torsion-angle molecular dynamics (Rice & Brünger, 1994[link]). This task file automatically computes cross-validated [\sigma_{A}] estimates, determines the weighting scheme between the X-ray refinement target function and the geometric energy function (Brünger et al., 1989[link]), refines a flat bulk solvent model (Jiang & Brünger, 1994[link]) and an overall anisotropic B value for the model by least-squares minimization, and subsequently refines the atomic positions by simulated annealing. Options are available for specification of alternate conformations, multiple conformers (Burling & Brünger, 1994[link]), noncrystallographic symmetry constraints and restraints (Weis et al., 1990[link]), and `flat' solvent models (Jiang & Brünger, 1994[link]). Available target functions include the maximum-likelihood functions MLF, MLI and MLHL (Pannu & Read, 1996a[link]; Adams et al., 1997[link]; Pannu et al., 1998[link]). The user can choose between slow cooling (Brünger et al., 1990[link]) and constant-temperature simulated annealing, and the respective rate of cooling and length of the annealing scheme. For a review of simulated annealing in X-ray crystallography, see Brünger et al. (1997)[link].

During simulated-annealing refinement, the model can be significantly improved. Therefore, it becomes important to recalculate the cross-validated [\sigma_{A}] error estimates (Kleywegt & Brunger, 1996[link]; Read, 1997[link]) and the weight between the X-ray diffraction target function and the geometric energy function in the course of the refinement (Adams et al., 1997[link]). This is important for the maximum-likelihood target functions that depend on the cross-validated [\sigma_{A}] error estimates. In the simulated-annealing task file, the recalculation of [\sigma_{A}] values and subsequently the weight for the crystallographic energy term are carried out after initial energy minimization, and also after molecular-dynamics simulated annealing.

25.2.3.10. Conclusions

| top | pdf |

CNS is a general system for structure determination by X-ray crystallography and solution NMR. It covers the whole spectrum of methods used to solve X-ray or solution NMR structures. The multi-layer architecture allows use of the system with different levels of expertise. The HTML interface allows the novice to perform standard tasks. The interface provides a convenient means of editing complicated task files, even for the expert (Fig. 25.2.3.10[link]). This graphical interface makes it less likely that an important parameter will be overlooked when editing the file. In addition, the graphical interface can be used with any task file, not just the standard distributed ones. HTML-based documentation and graphical output is planned in the future.

Most operations within a crystallographic algorithm are defined through modules and task files. This allows for the development of new algorithms and for existing algorithms to be precisely defined and easily modified without the need for source-code modifications.

The hierarchical structure of CNS allows extensive testing at each level. For example, once the source code and CNS basic commands have been tested, testing of the modules and task files is performed. A test suite consisting of more than a hundred test cases is frequently evaluated during CNS development in order to detect and correct programming errors. Furthermore, this suite is run on several hardware platforms in order to detect any machine-specific errors. This testing scheme makes CNS highly reliable.

Algorithms can be readily understood by inspecting the modules or task files. This self-documenting feature of the modules provides a powerful teaching tool. Users can easily interpret an algorithm and compare it with published methods in the literature. To our knowledge, CNS is the only system that enables one to define symbolically any target function for a broad range of applications, from heavy-atom phasing or molecular-replacement searches to atomic resolution refinement.

25.2.4. The TNT refinement package

| top | pdf |
D. E. Tronrudm* and L. F. Ten Eycky

25.2.4.1. Scope and function of the package

| top | pdf |

TNT (Tronrud et al., 1987[link]) is a computer program package that optimizes the parameters of a molecular model given a set of observations and indicates the location of errors that it cannot correct. Its authors presume the principal set of observations to be the structure factors observed in a single-crystal diffraction experiment. To complement such a data set, which for most macromolecules has limitations, stereochemical restraints such as standard bond lengths and angles are also used as observations.

A molecule is parameterized as a set of atoms, each with a position in space, an isotropic B factor and an occupancy. The complete model also includes an overall scale factor, which converts the arbitrary units of the measured structure factors to e Å−3, and a two-parameter model of the electron density of the bulk solvent.

Because a TNT model of a macromolecule does not allow anisotropic B factors, TNT cannot be used to finish the refinement of any structure that diffracts to high enough resolution to justify the use of these parameters. If one has a crystal that diffracts to 1.4 Å or better, the final model should probably include these parameters and TNT cannot be used. One may still use TNT in the early stages of such a refinement because one usually begins with only isotropic B's.

At the other extreme of resolution, TNT begins to break down with data sets limited to only about 3.5 Å data. This breakdown occurs for two reasons. First, at 3.5 Å resolution, the maps can no longer resolve β-sheet strands or α-helices. The refinement of a model against data of such low resolution requires strong restraints on dihedral angles and hydrogen bonds – tasks for which TNT is not well suited. Second, the errors in an initial model constructed with only 3.5 Å data are usually of such a magnitude and quality that the function minimizer in TNT cannot correct them.

25.2.4.2. Historical context

| top | pdf |

The design of TNT began in the late 1970s, and the first publishable models were generated by TNT in 1981 (Holmes & Matthews, 1981[link]). Its design was greatly influenced by observations of the strength and weaknesses of programs then available.

The first refinement of a protein model was performed by Jensen and co-workers at the University of Washington (Watenpaugh et al., 1973[link]). This structure refinement was atypical because of the availability of high-resolution data. The techniques of pre-least-squares small-molecule refinement were simply applied to this much larger model. Since many of the calculations were performed manually, no comprehensive software package was created for distribution.

It was quickly realized that for macromolecular refinement to become common, the calculations had to be fully automated and ideal stereochemistry had to be enforced. In the late 1970s, four programs became available, all of which automated the refinement calculations, but each implemented the enforcement of stereochemistry in different ways. They were PROLSQ (Hendrickson & Konnert, 1980[link]), EREF (Jack & Levitt, 1978[link]), CORELS (Sussman et al., 1977[link]) and FFTSF (Agarwal, 1978[link]). PROLSQ was, ultimately, the most popular.

At one end of the spectrum lay FFTSF. This program optimized its models to the diffraction data while completely ignoring ideal geometry. Following a number of iterations of optimizing the fit of the model to the structure factors, the geometry was idealized by running a separate program. At the other extreme was CORELS. It optimized its models to the diffraction data while allowing no deviations from ideal stereochemistry. The model was allowed to change only through the rotation of single bonds and the movement of rigid groups. Both approaches were frustrating to a certain extent. With FFTSF it was a struggle to find a model that agreed with all observations at once. With CORELS it was difficult to get the model to fit the density, because small and, apparently, insignificant deviations from ideality often added up after many residues to large and significant displacements, and these were forbidden. Neither approach to stereochemistry seemed very convenient, although CORELS was used for early-stage refinement for many years because of its exceptional radius of convergence.

Both PROLSQ and EREF enforced ideal stereochemistry and agreement with the diffraction data simultaneously. This strategy proved very convenient and generated models that satisfied their users. The two programs differed significantly in the form in which they required the ideal values be entered. PROLSQ required that the ideal values for both bond lengths and bond angles be entered as distances, e.g. an angle was defined by the distance between the two extreme atoms. EREF required that the standard value for an angle simply be entered as the number of degrees. Since EREF stored its library of standard values in the same terms as those with which people were familiar, it was much easier to enter the values.

These two programs differed in another way as well. PROLSQ stored ideal values for the stereochemistry of each type of residue (e.g. alanine, glycine etc.), while EREF parameterized the library in terms of atom types. For example, the angle formed by three atoms, the first a keto oxygen, the second a carbonyl carbon and the third an amide nitrogen, would have a particular ideal value regardless of where these three atoms occurred. In this matter, PROLSQ was more similar to the thought patterns of crystallographers.

25.2.4.3. Design principles

| top | pdf |

TNT was designed with three fundamental principles in mind. Each principle has a number of consequences that shaped the ultimate form of the package.

25.2.4.3.1. Refinement should be simple to run

| top | pdf |

The user should not be burdened with the choice of input parameters that they may not be qualified to choose. They also should not be forced to construct an input file that is obscure and difficult to understand. It is hard now to remember what most computer programs were like in the 1970s. Usually, the input to the program was a block of numbers and flags where the meaning of each item was defined by its line and column numbers. This block not only contained information the programmer could never anticipate, like the cell constants, but defined how the computer's memory should be allocated and obscure parameters that could only be estimated after careful reading of research papers.

TNT was one of the first programs in crystallography to have its input introduced with keywords and to allow input statements to come in any order. As an example of the difference, consider the resolution limits. Usually, a crystallographic program would have a line in its input similar to [\tt99.0,1.9,] One had to recognize this line amongst many as the line containing the resolution limits. (In many programs, a value of 99 was used to indicate that no lower-resolution limit was to be applied.) In TNT the same data would be entered as [\tt\hbox{RESOLUTION } 1.9] The keyword identifies the data as the resolution limit(s). If the statement contained two numbers, they were considered the upper and lower limits of the diffraction data.

The preceding example also shows how default values can be implemented by a program much more safely with keyword-based input. In the previous scheme, if a value was ever to be changed by the user, its place had to be allocated in the input block. This often left numbers floating in the block which were almost never changed, and because they were so infrequently referred to, they were usually unrecognized by the user. It was quite possible for one of these numbers to be accidentally changed and the error unnoticed for quite some time. When the data are introduced with keywords, a data item is not mentioned if the default value is suitable.

25.2.4.3.2. Refinement should run quickly and use as little memory as possible

| top | pdf |

The most time-consuming calculations in refinement are the calculation of structure factors from atomic coordinates and the calculation of derivatives of the part of the residual dependent upon the diffraction data with respect to the atomic parameters. The quickest means of performing these calculations requires the use of space-group-optimized fast Fourier transforms (FFTs). The initial implementation of TNT used FFTs to calculate structure factors, but the much slower direct summation method to calculate the derivatives. Within a few years, Agarwal's method (Agarwal, 1978[link]; Agarwal et al., 1981[link]) was incorporated into TNT and from then on all crystallographic calculations were performed with FFTs.

The FFT programs of Ten Eyck (1973[link], 1977[link]) made very efficient use of computer memory. Another means of saving memory was to recognize that the code for calculating stereochemical restraints did not need to be in the memory when the crystallographic calculations were being performed and vice versa. There were two ways to save memory using this information. One could create a series of `overlays' or one could break the calculation into a series of separate programs. The means for defining an overlay structure were never standardized and could not be ported from one type of computer to another and were, therefore, never attempted in TNT. For this reason, and a number of others mentioned here, TNT is not a single program but a collection of programs, each with a well defined and specialized purpose.

25.2.4.3.3. The source code should not require customization for each project

| top | pdf |

The need to state this goal seems remarkable in these modern times, but the truth is that most computer programs in the 1970s required specific customizations before they could be used. The simplest modifications were the definitions of the maximum number of atoms, residues, atom types etc. accepted by the program. These modifications are still required in Fortran77 programs because that language does not allow the dynamic allocation of memory. However, in most programs today the limits are set high enough that the standard configuration does not present a problem.

The most difficult modification required for programs like PROLSQ was to adapt the calculations to the space group in hand. Their authors usually included code for the space groups they were particularly interested in, leaving all others to be implemented by the user. Writing code for a new space group was often a daunting task for someone who was not an expert programmer and had no tools for testing the modifications.

It is too burdensome to require the user to understand sufficiently the internal workings of a complex calculation that they can code and debug central subroutines of a refinement program. In its initial implementation, TNT avoided this problem, to an extent, by performing the space-group-specific calculations in separate programs. At least the user did not need to modify an existing program. All that was required was the construction of a program that read the proper format file, performed the calculation and wrote its answer in the proper format. The user was required to supply both a program that could calculate structure factors from the model and another program that could calculate the derivative of the diffraction component of the residual function with respect to the atomic parameters of the model.

While a structure-factor program could usually be located, either by finding an existing program or by expanding the model to a lower-symmetry space group for which a program did exist, the requirement of creating a derivative program proved too great a burden. The derivation of the space-group-specific calculation, its implementation and debugging proved too difficult for almost everyone, and this design was quickly abandoned. Instead, an implementation of Agarwal's (1978) algorithm was created. In this method, the derivatives are calculated with a series of convolutions with an [F_{o} - F_{c}] map. The calculation of the map is the only space-group-specific part of the calculation, and this was done with a separate program for calculating Fourier syntheses. Such programs were as easy to come by as structure-factor calculation programs and could be replaced by a lower-symmetry program if required.

While it is easier to find or write a program that only calculates a Fourier transform and much easier to debug one than to debug a modification to a larger and more complex program, it is still difficult. The lack of availability of programs for the space group of a crystal often prevented the use of TNT. Over time, programs for more space groups were written and distributed with TNT. Eventually, a method was developed by one of TNT's authors in which FFTs could be calculated using a single program as efficiently as the original space-group-specific programs. Once this program existed, there was no longer the need for isolated structure-factor and Fourier synthesis programs. These calculations have disappeared into the heart of TNT, and TNT consists of many fewer programs today than in the past.

25.2.4.4. Current structure of the package

| top | pdf |

TNT presents different faces to different users. Some users simply want to run refinement; they see the shell interface. Others want to use the TNT programs in untraditional ways; they see the program interface. A few users want to change the basic calculations of TNT; they see the library interface.

The shell interface is the view of TNT that most people see. It is the most recent structural addition, having been added in release 5E in 1995. At this level, the restraints, weights and parameters of the model are described in the `TNT control file' and the user performs particular calculations by giving commands at the shell prompt. For example, refinement is performed with the `tnt' command and maps are calculated for examination with some graphics program with the `make_maps' command. TNT is supplied with about two dozen shell commands. These commands allow the running of refinement, the conversion of the model to and from TNT's internal format, and the examination of the model to locate potential problem spots. The TNT Users' Guide describes the use of TNT at this level.

The program interface consists of the individual TNT programs along with their individual capabilities. TNT consists of the program Shift, which handles all the minimization calculations, a program for each module (restraints that fall into a common class, e.g. diffraction data, ideal stereochemistry and noncrystallographic symmetry) and a number of utility programs of which the most important member is the program Convert, which reads and writes coordinate files in many formats. The user can write shell scripts (or modify those supplied with TNT) to perform a great many tasks that cannot be accessed with the standard set of scripts. The TNT Reference Manual describes the operation of each program.

If the programs in TNT do not perform the calculation wanted, the source code can be modified. The source code to TNT is supplied with the standard distribution. In order to make the code more manageable and understandable, it is divided into half a dozen libraries. All TNT programs use the lowest-level library to ensure consistency of the `look and feel' and use the basic data structures for storage of the model's parameters and the vital crystal data. To add new functionality, one can either modify an existing program, write a new program using the TNT libraries as a start, or write a new program from scratch ignoring the TNT libraries. As long as a program can read and write files of the same format as the rest of TNT, it will work well with TNT, even if it does not share any code. A library exists, but is not copyrighted, that contains subroutines to read and write the crystallographic file formats used by the rest of TNT.

25.2.4.5. Innovations first introduced in TNT

| top | pdf |

TNT was not only designed to be an easy-to-use tool for the refinement of macromolecular models, but also as a tool for testing new ideas in refinement. Since its source code is designed to allow easy reordering of tasks and simple modifications, a number of innovations in refinement made their first appearance in TNT. These features include the following.

25.2.4.5.1. Identifying and restraining symmetry-related contacts (1982)

| top | pdf |

Without a search for symmetry-related bad contacts, it was quite common to build atoms into the same density from two different sides of the molecule. A number of models in the PDB contain these types of errors because neither the refinement nor the graphics programs available at that time would indicate this type of error.

25.2.4.5.2. The ability of a single package to perform both individual atom and rigid-body refinement (1982)

| top | pdf |

Prior to TNT, one often started a refinement with rigid-body refinement using CORELS and then switched to another program. TNT was the first refinement package to allow both styles of refinement. One was not required to learn about two different packages when running TNT.

25.2.4.5.3. Space-group optimized FFTs for all space groups (1989)

| top | pdf |

This innovation allowed TNT to run efficiently in all space groups available to macromolecular crystals.

25.2.4.5.4. Modelling bulk solvent scattering via local scaling (∼1989)

| top | pdf |

With a simple and quick model of the scattering of the bulk solvent in the crystal (Tronrud, 1997[link]), the low-resolution data could be used in refinement for the first time. The inclusion of these data in the calculation of maps greatly improved their appearance.

25.2.4.5.5. Preconditioned conjugate-gradient minimization (1990)

| top | pdf |

This method of minimization (Axelsson & Barker, 1984[link]; Tronrud, 1992[link]) allows the direct inclusion of the diagonal elements of the second-derivative matrix and the indirect inclusion of its off-diagonal elements. An additional benefit is that it allows both positional parameters and B factors to be optimized in each cycle. Previously, one was required to hold one class of parameter fixed while the other was optimized. It is much more efficient and simpler for the user to optimize all parameters at once. This method, because it incorporates the diagonal elements directly, produces sets of B factors that agree with the diffraction data better than those from the simple conjugate-gradient method.

25.2.4.5.6. Restraining stereochemistry of chemical links to symmetry-related molecules (∼1992)

| top | pdf |

It is not uncommon for crystallization enhancers to be found on a special position in the crystal. In addition, cross-linking the molecules in a crystal is often done for various reasons. In both cases, the model contains chemical bonds to a molecule or atoms in another asymmetric unit of the crystal. In order for the stereochemistry of these links to be properly restrained, it must be possible to describe such a link to the refinement program.

25.2.4.5.7. Knowledge-based B-factor restraints (∼1994)

| top | pdf |

When the resolution of the diffraction data set is less than about 2 Å, the individual B factors of a refined model are observed to vary wildly from atom to atom, even when the atoms are bonded to one another. This pattern is not reasonable if one interprets the B factor as a measure of the vibrational motion of the atom. Traditionally, one applies an additional restraint on the B factors of the model, where the ideal value for the difference in B factor for two bonded atoms is zero.

Since it is clear from examinations of higher-resolution models that the B factors generally increase from one side of a bond to the other (e.g. moving from the main chain to the end of a side chain), the traditional restraint is flawed. A restraint library was generated (Tronrud, 1996[link]) where each bond in a residue is assigned a preferred increment in B factor and a confidence (standard deviation) in that increment.

25.2.4.5.8. Block-diagonal preconditioned conjugate-gradient minimization with pseudoinverses (1998)

| top | pdf |

With this enhancement, TNT's minimizer treats the second-derivative matrix as a collection of 5 × 5 element blocks along its diagonal, one block for each atom. While this method improves the rate of convergence for noncrystallographic symmetry restraints, its most significant feature is that it allows the refinement of atoms located on special positions without special handling by the user.

25.2.4.5.9. Generalization of noncrystallographic symmetry operators to include shifts in the average B factor (1998)

| top | pdf |

It is rather common in crystals containing multiple copies of a molecule in the asymmetric unit for one or more molecules to have a higher B factor than the others. If the transformation that generates each copy of the molecule consists only of a rotation and translation of the positions of the atoms, the difference in B factors cannot be modelled. The transformations used in TNT now consist of a rotation, translation, a B-factor shift and an occupancy shift.

25.2.4.6. TNT as a research tool

| top | pdf |

TNT was intended not only as a tool for performing refinement, but as a tool for developing new ideas in refinement. While most of the latter has been done by TNT's authors, several others have made good use of TNT in this fashion. If one has an idea to test, the overhead of writing an entire refinement package to perform that test is overwhelming. TNT allows modification at a number of levels, so one can choose to work at the level that allows the easiest implementation of the idea. Several examples follow.

25.2.4.6.1. Michael Chapman's real-space refinement package

| top | pdf |

At Florida State University, Chapman has implemented a real-space refinement package, principally intended for the refinement of virus models, using TNT. He was able to use TNT's minimizer and stereochemical restraints unchanged along with programs he developed to implement his method. More information about this package can be found at http://www.sb.fsu.edu/∼rsref .

25.2.4.6.2. Gerard Bricogne's Buster refinement package

| top | pdf |

Bricogne & Irwin (1996)[link] have developed a maximum-likelihood refinement package using TNT. Not only are TNT's minimizer and stereochemical restraints used, but many of the calculations of the maximum-likelihood residual's derivatives are performed using TNT programs. While Bricogne and co-workers have not needed to modify TNT programs to implement their ideas, there is ongoing collaboration between them and TNT's authors on the development of commands that allow access to some previously internal calculations. More information about Buster can be found at http://babinet.globalphasing.com/buster/ .

25.2.4.6.3. Randy Read's maximum-likelihood function

| top | pdf |

When Navraj Pannu wanted to implement Read's maximum-likelihood refinement functions (Pannu & Read, 1996b[link]) in TNT, he choose not to implement it as a separate program, but modified TNT's source code to create a new version of the program Rfactor, named Maxfactor.

25.2.4.6.4. J. P. Abrahams' likelihood-weighted noncrystallographic symmetry restraints

| top | pdf |

Abrahams (1996)[link] conceived the idea that because some amino-acid side chains can be expected to violate the noncrystallographic symmetry (NCS) of the crystal more than others, one could develop a library of the relative strength with which each atom of each residue type would be held by the NCS restraint. He chose to determine these strengths from the average of the current agreement to the NCS of all residues of the same type. For example, if the lysine side chains do not agree well with their NCS mates, the NCS will be loosely enforced for those side chains. On the other hand, if almost all the valine side chains agree well with their mates, then the NCS will be strongly enforced for the few that do not agree well.

He chose to implement this idea by modifying the source code for the TNT program NCS. Since the calculations involved in implementing this idea are simple, the extent of the modifications were not large.

25.2.5. The ARP/wARP suite for automated construction and refinement of protein models

| top | pdf |
V. S. Lamzin,n* A. Perrakiso and K. S. Wilsonp

25.2.5.1. Refinement and model building are two sides of modelling a structure

| top | pdf |

The conventional view of crystallographic refinement of macromolecules is the optimization of the parameters of a model to fit both the experimental data and a set of a priori stereochemical observations. The user provides the model and, although the values of its parameters are allowed to vary during the minimization cycles, the presence of the atoms is fixed, i.e. the addition or removal of parts of the model is not allowed. As a result, users are often faced with a situation where several atoms lie in one place, while the density maps suggest an entirely different location. Manual intervention, consisting of moving atoms to a more appropriate place using molecular graphics, density maps and geometrical assumptions can solve the problem and allow refinement to proceed further.

The Automated Refinement Procedure (ARP; Fig. 25.2.5.1[link]) (Lamzin & Wilson, 1993[link], 1997[link]; Perrakis et al., 1999[link]) challenges this classical view by addition of real-space manipulation of the model, mimicking user intervention in silica. Adding and/or deleting atoms (model update) and complete re-evaluation of the model to create a new one that better describes the electron density (model reconstruction) can achieve this aim.

[Figure 25.2.5.1]

Figure 25.2.5.1 | top | pdf |

A flow chart of the Automated Refinement Procedure.

25.2.5.1.1. Model update

| top | pdf |

The quickest way to change the position of an atom substantially is not to move it, but rather involves a two-step procedure to remove it from its current (probably wrong) site and to add a new atom at a new (hopefully right) position. Such updating of the model does not imply that all rejected atoms are immediately repositioned in a new site, so the number of atoms to be added does not have to be equal to the number rejected.

Atom rejection in ARP is primarily based on the interpolated [2mF_{o} - \Delta F_{c}] or [3F_{o} - 2F_{c}] electron density at its atomic centre and the agreement of the atomic density distribution with a target shape. Applied together, these criteria offer powerful means of identifying incorrectly placed atoms, but can suggest false positives. However, a correctly located atom that happens to be rejected should be selected again and put back in the model. Developments of further, perhaps more elegant, criteria may be expected in the future development of the technique.

Atom addition uses the difference [mF_{o} - \Delta F_{c}] or [F_{o} - F_{c}] Fourier synthesis. The selection is based on grid points rather than peaks, as the latter are often poorly defined and may overlap with neighbouring peaks or existing atoms, especially if the resolution and phases are poor. The map grid point with the highest electron density satisfying the defined distance constraints is selected as a new atom, grid points within a defined radius around this atom are rejected and the next highest grid point is selected. This is iterated until the desired number of new atoms is found and reciprocal-space minimization is used to optimize the new atomic parameters.

Real-space refinement based on density shape analysis around an atom can be used for the definition of the optimum atomic position. Atoms are moved to the centre of the peak using a target function that differs from that employed in reciprocal-space minimization. The function used is the sphericity of the site, which keeps an atom in the centre of the density cloud but has little influence on the R factor and phase quality. It is only applicable for well separated atoms and is mainly used for solvent atoms at high resolution.

Geometrical constraints are based on a priori chemical knowledge of the distances between covalently linked carbon, nitrogen and oxygen atoms (1.2 to 1.6 Å) and hydrogen-bonded atoms (2.2 to 3.3 Å). Such constraints are applied in rejection and addition of atoms.

25.2.5.1.2. Model reconstruction

| top | pdf |

The main problem in automatically reconstructing a protein model from electron-density maps is in achieving an initial tracing of the polypeptide chain, even if the result is only partially complete. Subsequent building of side chains and filling of possible gaps is a relatively straightforward task. The complexity of the autotracing can be nicely illustrated as the well known travelling-salesman problem. Suppose one is faced with 100 trial peptide units possessing two incoming and two outgoing connections on average, which is close to what happens in a typical ARP refinement of a 10 kDa protein. Assuming that one of the chain ends is known and that it is possible to connect all the points regardless of the chosen route, then one is faced with the problem of choosing the best chain out of 298. In practice, the situation is even more complex, as not all trial peptides are necessarily correctly identified in the first iteration and some may be missing – analogous to the correctness or incorrectness of the atomic positions described above.

If the connections can be assigned a probability of the peptide being correct, then only the path that visits each node exactly once and maximizes the total probability remains to be identified. Automatic density-map interpretation is based on the location of the atoms in the current model and consists of several steps. Firstly, each atom of the free-atom model is assigned a probability of being correct. Secondly, these weighted atoms are used for identification of patterns typical for a protein. The method utilizes the fact that all residues that comprise a protein, with the exception of cis peptides, have chemically identical main-chain fragments which are close to planar: the structurally identical Cα—C—O—N—Cα trans peptide units.

The problem of searching for possible peptide units and their connections thus becomes straightforward. The most crucial factor is that proteins are composed of linear non-branching polypeptide chains, allowing sets of connected peptides to be obtained from an initial list of all possible tracings. Choosing the direction of a chain path is carried out on the basis of the electron density and observed backbone conformations. The set of peptide units and the list of how they are interconnected do not, however, allow unambiguous tracing of a full-length chain in most cases.

Taken together, the probabilistic identification of the peptide units, the naturally high conformational flexibility of the connections of the peptide units and the limited quality of the X-ray data and/or phases introduce large enough errors to cause density breaks in the middle of the chains or result in density overlaps. Thus, the result of such a tracing is usually a set of several main-chain fragments. The less accurate the starting maps (i.e. initial phases) and the lower the resolution and quality of the X-ray data, the more breaks there will be in the tracing and the greater the number of peptide units which will be difficult to identify.

Residues are differentiated only as glycine, alanine, serine and valine, and complete side chains are not built at this stage. For every polypeptide fragment, a side-chain type can be assigned with a defined probability, using connectivity criteria from the free-atom models and the α-carbon positions of the main-chain fragments. Given these guesses for the side chains and provided the sequence is known, the next step employs docking of the polypeptide fragments into the sequence. Each possible docking position is assigned a score, which allows automated inspection of the side-chain densities, search for expected patterns and building of the most probable side-chain conformations.

25.2.5.1.3. Representation of a map by free-atom models

| top | pdf |

An electron-density map can be used to create a free-atom atomic model, with equal atoms placed in regions of high density (Perrakis et al., 1997[link]). To build this model, only the molecular weight of the protein is required, without any sequence information. In brief, a map covering a crystallographic asymmetric unit on a fine grid of about 0.25 Å is constructed. The model is slowly expanded from a random seed by the stepwise addition of atoms in significant electron density and at bonding distances from existing atoms. All atoms in this model and in all subsequent steps are considered to be of the same type. As ARP proceeds, the geometrical criteria remain the same, but the density threshold is gradually reduced, allowing positioning of atoms in lower-density areas of the map. The procedure continues until the number of atoms is about three times that expected. This number is then reduced to about n + 20% atoms by removing atoms in weak density. This method of map parameterization has the advantage that it puts atoms at protein-like distances while covering the whole volume of the protein.

25.2.5.1.4. Hybrid models

| top | pdf |

A free-atom model can describe almost every feature of an electron-density map, but this interpretation rarely resembles a conventional conception of a protein. Nevertheless, information from parts of the improved map and the free-atom model can be automatically recognized as containing elements of protein structure by applying the algorithms briefly described for model reconstruction, and at least a partial atomic protein model can be built. Combination of this partial protein model with a free-atom set (a hybrid model) allows a considerably better description of the current map. The protein model provides additional information (in the form of stereochemical restraints), while prominent features in the electron density (unaccounted for by the current model) are described by free atoms.

25.2.5.1.5. Real-space manipulation coupled with reciprocal-space refinement

| top | pdf |

The procedure of real-space manipulation is coupled to least-squares or maximum-likelihood optimization of the model's parameters against the X-ray data. This is the scheme that we generally refer to as ARP refinement, though there are two distinct modes of ARP: In the unrestrained mode, all atoms in reciprocal-space refinement are treated as free atoms with unknown connectivity and are refined against the experimental data alone. This mode has a higher radius of convergence but needs high-resolution diffraction data to perform effectively. In the restrained mode, a model or a hybrid model is required, i.e. the atoms must belong to groups of known stereochemistry. This stereochemical information, in the form of restraints, can then be utilized during the reciprocal-space minimization, allowing it to proceed with less data, presuming that the connectivity of the input atoms is basically correct.

25.2.5.2. ARP/wARP applications

| top | pdf |

25.2.5.2.1. Model building from initial phases

| top | pdf |

The hybrid models described above are used as the main tool for obtaining as full a protein model as possible from the map calculated with the initial phases.

Given the information contained in the hybrid model in the traditional form of stereochemical restraints, reciprocal-space refinement can work more efficiently, new improved phases can be obtained and a more accurate and complete protein model can be constructed. The new hybrid model can be re-input to refinement and these steps can be iterated so that improved phases result in construction of ever larger parts of the protein. An almost complete protein model can be obtained in a fully automated way.

25.2.5.2.2. Refinement of molecular-replacement solutions

| top | pdf |

Starting from a molecular-replacement solution implies that a search model positioned in the new lattice is already available. The model can be directly incorporated in restrained ARP refinement. If the starting model is very incomplete or different, its atoms can be regarded as free atoms and the solution can be treated as starting from just initial phases. This increases the radius of convergence and minimizes the bias introduced by the search model.

25.2.5.2.3. Density modification via averaging of multiple refinements

| top | pdf |

Slightly varying the protocol described for generating models from maps results in a set of slightly different free-atom models. Each model is then submitted to ARP. In protein crystallography, there are generally insufficient data for convergence of free-atom refinement to a global minimum and different starting models result in final models with small differences, i.e. containing different errors. Averaging of these models can be utilized to minimize the overall error. The procedure in effect imposes a random noise, small enough to be eliminated during the subsequent averaging, but large enough to overcome at least some of the systematic errors.

Structure factors are calculated for all the refined models and a vector average of the calculated structure factors is derived. The phase of the vector average is more accurate than that from any of the individual models. A weight, WwARP, is assigned to each structure factor on the basis of the variance of the two-dimensional distribution of the individual structure factors around the mean. The mean value of WwARP over all reflections and the R factor after averaging can be used to judge the progress of the averaging procedure.

25.2.5.2.4. Ab initio solution of metalloproteins

| top | pdf |

If the coordinates of one or a few heavy atoms are known, initial phases can be calculated. The problem of solving the structure of such a metalloprotein from the sites of the metal alone can be considered in the same framework as for heavy-atom-replacement solutions. Maps calculated from the phases of heavy atoms alone often have the best defined features within a defined radius of the heavy atom(s). Thus protocols that do not place all atoms at the start but instead perform a slow building while extending the model in a growing sphere around the heavy atom are preferred. When such a model is essentially complete, it can be used for automated tracing and completion of the model.

25.2.5.2.5. Solvent building

| top | pdf |

In this application, the protein (or nucleic acid) model is not rebuilt during refinement, and only the solvent structure is continuously updated, allowing the construction of a solvent model without iterative manual map inspection.

25.2.5.3. Applicability and requirements

| top | pdf |

Density-based atom selection for the whole structure is only possible if the X-ray data extend to a resolution where atomic positions can be estimated from the Fourier syntheses with sufficient accuracy for them to refine to the correct position. If the structural model is of reasonable quality, at 2.5 Å or better, at least a part of the solvent structure or a small missing or badly placed part of the protein can be located. This provides indirect improvement of the whole structure. For automated model rebuilding, or for refining poor molecular-replacement solutions, higher resolution is essential. The general requirement is that the number of X-ray reflections should be at least six to eight times higher than the number of atoms in the model, which roughly corresponds to a resolution of 2.3 Å for a crystal with 50% solvent. However, the method can work at lower resolution or fail with a higher one, depending less on the quality of the initial phases and more on the internal quality of the data and on the inherent disorder of the molecule.

The X-ray data should be complete. If strong low-resolution data (e.g. 4 to 10 Å) are systematically missing, e.g. due to detector saturation, the electron density even for good models is often discontinuous. Because ARP involves updating on the basis of density maps, such discontinuity will lead to incorrect interpretation of the density and slow convergence or even uninterpretable output.

25.2.5.4. An example

| top | pdf |

The structure of chitinase A from Serratia marcescens (Perrakis et al., 1994[link]) was initially solved by multiple isomorphous replacement with anomalous signal (MIRAS), with only a single derivative contributing to resolution higher than 5.0 Å. The MIRAS map (2.5 Å) was solvent-flattened. Model building was not straightforward and much time was spent in tracing the protein chain.

As an experiment, the solvent-flattened map was used to initiate building of free-atom models, using least-squares minimization against the native 2.3 Å data combined with ARP. This resulted in crystallographic R factors ranging between 20.1 and 22.4%. Each ARP model gave phases marginally worse than those available by solvent flattening alone, due to the limited resolution of the native data. However, the wARP averaging procedure resulted in a reduction of 11.2° in the weighted mean phase error. The map correlation coefficient between the final map and the wARP map was 81.2%, better by 12.8% compared with the solvent-flattened map.

The wARP model with the lowest R factor was used to initiate model building. In the initial tracing, 75 residues were identified, belonging to more than 20 different main-chain fragments. After autobuilding, ten cycles of restrained ARP were run according to the standard protocol. One REFMAC cycle of conjugate-gradient minimization was executed to optimize a maximum-likelihood residual and bulk solvent scaling. [\sigma_{A}]-weighted maps were calculated and ARP was used to update the model. All atoms (main-chain, side-chain and free atoms) were allowed to be removed and new atoms were added where appropriate. After ten iterations, a new building cycle was invoked. After every `big' cycle, a more complete model was obtained. This `big' cycle was iterated 20 times. Finally, 515 residues were traced in nine chains, all of which were docked unambiguously into the sequence. This is the lowest-resolution application to date. 2.3 Å was the real resolution limit of the data measured from these crystals; however, the high solvent content (61%) provided on average seven observations per atom and an almost complete trace was easily accomplished.

25.2.6. PROCHECK: validation of protein-structure coordinates

| top | pdf |
R. A. Laskowski,w M. W. MacArthurq and J. M. Thorntonx*

25.2.6.1. Introduction

| top | pdf |

As in all scientific measurements, the parameters that result from a macromolecular structure determination by X-ray crystallography (e.g. atomic coordinates and B factors) will have associated uncertainties. These arise not only from systematic and random errors in the experimental data but also in the interpretation of those data. Currently, the uncertainties cannot easily be estimated for macromolecular structures due to the computer- and memory-intensive nature of the calculations required (Tickle et al., 1998[link]). Thus, more indirect methods are necessary to assess the reliability of different parts of the model, as well as the reliability of the model as a whole. Among these methods are those which rely on checking only the stereochemical and geometrical properties of the model itself, without reference to the experimental data (MacArthur et al., 1994[link]; Laskowski et al., 1998[link]). Here we describe PROCHECK (Laskowski et al., 1993[link]), which is one of these structure-validation methods.

The PROCHECK program computes a number of stereochemical parameters for the given protein model and compares them with `ideal' values obtained from a database of well refined high-resolution protein structures in the Protein Data Bank (PDB; Bernstein et al., 1977[link]). The results of these checks are output in easy-to-understand coloured plots in PostScript format (Adobe Systems Inc., 1985[link]). Significant deviations from the derived standards of normality are highlighted as being `unusual'.

The program's primary use is during the refinement of a protein structure; the highlighted regions can direct the crystallographer to parts of the structure that may have problems and which may need attention. It should be noted that outliers may just be outliers; they are not necessarily errors. Unusual features may have a reasonable explanation, such as distortions due to ligand binding in the protein's active site. However, if there are many oddities throughout the model, this could signify that there is something wrong with it as a whole. Conversely, if a model has good stereochemistry, this alone is not proof that it is a good model of the protein structure.

Because the program requires only the 3D atomic coordinates of the structure, it can check the overall `quality' of any model structure: whether derived experimentally by crystallography or NMR, or built by homology modelling. In the case of NMR-derived structures, it is useful to compare the protein geometry across the whole ensemble. An extended version of PROCHECK, called PROCHECK-NMR, is available for this purpose (Laskowski et al., 1996[link]), but will not be described here.

Note that PROCHECK only examines the geometrical properties of protein molecules; it ignores DNA/RNA and other non-protein molecules in the structure, except in so far as checking that the non-bonded contacts these make with the protein do not violate a fixed distance criterion.

25.2.6.2. The program

| top | pdf |

PROCHECK is in fact a suite of separate Fortran and C programs which are run successively via a shell script. The programs first `clean up' the input PDB file, relabelling certain side-chain atoms according to the IUPAC naming conventions (IUPAC–IUB Commission on Biochemical Nomenclature, 1970[link]), then calculate all the protein's stereochemical parameters to compare them against the norms, and finally generate the PostScript output and a detailed residue-by-residue listing. Hydrogen and atoms with zero occupancy are omitted from the analyses and, where atoms are found in alternate conformations, only the highest-occupancy conformation is retained.

The source code for all the programs is available at http://www.biochem.ucl.ac.uk/∼roman/procheck/procheck.html . It has also been incorporated into the CCP4 suite of programs (Collaborative Computational Project, Number 4, 1994[link]) at http://www.dl.ac.uk/CCP/CCP4/main.html , and can be run directly via the web from the Biotech Validation Server at http://biotech.embl-ebi.ac.uk:8400/ .

25.2.6.3. The parameters

| top | pdf |

Table 25.2.6.1[link] shows the principal stereochemical parameters used by PROCHECK, based on the analysis of Morris et al. (1992)[link], who looked for measures that are good indicators of protein quality. The table shows the original parameters together with a more up-to-date set derived from a more recent data set including a number of atomic resolution structures (i.e. those solved to 1.4 Å resolution or better).

Table 25.2.6.1| top | pdf |
Summary of expected values for stereochemical parameters in well resolved structures

ParameterOldNew
% ϕ, ψ in core [\gt\! 90.0\%] [\gt\! 90.0\%]
[\chi_{1}\ gauche^{-}] [+ 64.1 \pm 15.7^{\circ}] [+ 63.2 \pm 11.4^{\circ}]
[\chi_{1}\ trans] [+ 183.6 \pm 16.8^{\circ}] [+ 182.7 \pm 13.1^{\circ}]
[\chi_{1}\ gauche^{+}] [- 66.7 \pm 15.0^{\circ}] [- 66.0 \pm 11.2^{\circ}]
[\chi_{1}] pooled standard deviation [\pm 15.7^{\circ}] [\pm 11.8^{\circ}]
[\chi_{2}] trans [+ 177.4 \pm 18.5^{\circ}] [+ 177.2 \pm 15.1^{\circ}]
[\chi_{3}] S—S bridge (left-handed) [- 85.8 \pm 10.7^{\circ}] [- 84.8 \pm 8.5^{\circ}]
[\chi_{3}] S—S bridge (right-handed) [+ 96.8 \pm 14.8^{\circ}] [+ 92.2 \pm 10.8^{\circ}]
Proline ϕ [- 65.4 \pm 11.2^{\circ}] [- 64.6 \pm 10.2^{\circ}]
α-Helix ϕ [- 65.3 \pm 11.9^{\circ}] [- 65.5 \pm 11.1^{\circ}]
α-Helix ψ [- 39.4 \pm 11.3^{\circ}] [- 39.0 \pm 9.8^{\circ}]
ω trans [+ 179.6 \pm 4.7^{\circ}] [+ 179.5 + 6.0^{\circ}]
Cα—N—C′—Cβ (ζ) virtual torsion angle [+ 33.9 \pm 3.5^{\circ}] [+ 34.2 \pm 2.6^{\circ}]

For the most part, the parameters given in Table 25.2.6.1[link] are not included in standard refinement procedures and so are less likely to be biased by them. They can thus provide a largely independent and unbiased validation check on the geometry of each residue and hence point to regions of the protein structure that are genuinely unusual.

As more atomic resolution structures become available (Dauter et al., 1997[link]), these parameters will be improved. Because of their high data-to-parameter ratio, such structures can be refined using less strict restraints, and hence contain a smaller degree of bias in their geometrical properties – at least for the well ordered parts of the model. Such information moves us a step closer to an understanding of the `true' geometrical and conformational properties of proteins in general and, one day, the target parameters will be derived exclusively from such structures.

PROCHECK also checks main-chain bond lengths and bond angles against the `ideal' values given by the Engh & Huber (1991)[link] analysis of small-molecule structures in the Cambridge Structural Database (CSD) (Allen et al., 1979[link]). Unlike the above parameters, these geometrical properties are usually restrained during refinement, and, furthermore, the Engh & Huber (1991)[link] targets are the ones most commonly applied. Thus analyses of these values merely reflect the refinement protocol used and do not provide meaningful indicators of local or overall errors. However, the plots clearly show any wayward outliers which can nevertheless indicate problem regions in the structure.

25.2.6.4. Which parameters are best?

| top | pdf |

Possibly the most telling and useful of the `quality' indicators for a protein model is the Ramachandran plot of residue ϕ–ψ torsion angles. This can often detect gross errors in the structure (Kleywegt & Jones, 1996a[link],b[link]). In the original Ramachandran plot (Ramachandran et al., 1963[link]; Ramakrishnan & Ramachandran, 1965[link]), the `allowed' regions were defined on the basis of simulations of dipeptides. In the PROCHECK version, the different regions of the plot are defined on the basis of how densely they are populated with data points taken from a database of well refined protein structures. The regions are: core, allowed, generously allowed and disallowed.

The `core' regions are particularly important; the points on the plot tend to converge towards these regions, and to cluster more tightly within them, as one goes from structures solved at low resolution to those solved at high resolution (Morris et al., 1992[link]). This trend has recently been confirmed by Wilson et al. (1998)[link], who looked at the case of atomic resolution structures. It has also been analysed in terms of `attractors' at the most favourable regions of the plot; as the resolution improves, so the points are drawn towards these attractors (Walther & Cohen, 1999[link]).

Fig. 25.2.6.1[link] shows the original PROCHECK Ramachandran plot and a more up-to-date version. The original was based on all 462 structures known at that time (1989/90), while the more recent one, generated in 1998, is based on 1128 non-identical (i.e. having a sequence identity [\lt]95%) structures. It can be seen that the second plot has core regions which are much tighter than the original, and this is primarily due to the increase in the number of very high resolution structures giving a more accurate representation of the tight clustering in the most favourable regions.

[Figure 25.2.6.1]

Figure 25.2.6.1 | top | pdf |

PROCHECK Ramachandran plots showing the different regions, shaded according to how `favourable' the ϕ–ψ combinations are, for (a) the original version of the program (1992) and (b) an updated version based on a more recent data set (1998) including more high-resolution structures. The `core' and other favourable regions of the plot are more tightly compressed in the new version, with the white, disfavoured regions occupying more of the space.

Another parameter that seems to be a particularly sensitive measure of quality is the standard uncertainty (s.u.) of the χ torsion angles. Morris et al. (1992)[link] found that the average values of a protein's χ1 and χ2 torsion angles are well correlated with the resolution at which the protein structure was solved. Although the data set was a fairly small one, the conclusion was borne out when tested on a larger set of more recent structures, including some solved to atomic resolution (Wilson et al., 1998[link]). This measure, however, cannot be relied on where side-chain conformations are either restrained or heavily influenced by the use of rotamer libraries.

25.2.6.5. Input

| top | pdf |

The primary input to PROCHECK is the file containing the 3D coordinates of the protein structure to be processed. The file is required to be in PDB format. An additional input file is the parameter file that governs which plots are to be generated and deals with certain aspects of their appearance.

25.2.6.6. Output produced

| top | pdf |

The output of the program consists of a number of PostScript plots, together with a full listing of the individual parameter values for each residue, with any unusual geometrical properties highlighted. The listing also provides summaries for the protein as a whole. Figs. 25.2.6.2[link] and 25.2.6.3[link] show parts of one of the PostScript plots generated, showing the variation of various residue properties along the length of the protein chain. Unusual regions, which are highlighted on these plots, may require further investigation by the crystallographer.

[Figure 25.2.6.2]

Figure 25.2.6.2 | top | pdf |

Two of the residue-property plots generated by PROCHECK. The plots shown here are (a) the absolute deviation from the mean of the χ1 torsion angle (excluding prolines) and (b) the absolute deviation from the mean of the ω torsion angle. Usually, three such plots are shown per page and can be selected from a set of 14 possible plots. On each graph, unusual values (usually those more than 2.0 standard deviations away from the `ideal' mean value) are highlighted.

[Figure 25.2.6.3]

Figure 25.2.6.3 | top | pdf |

Schematic plots of various residue-by-residue properties, showing (d) the protein secondary structure, with the shading behind it giving an approximation to each residue's accessibility, the darker the shading the more buried the residue; (e) the protein sequence plus markers identifying the region of the Ramachandran plot in which the residue is located; (f) a histogram of asterisks and plus signs showing each residue's maximum deviation from one of the ideal values, as shown on the residue-by-residue listing; and (g) the residue `G factor' values for various properties, where the darker the square the more `unusual' the property.

25.2.6.7. Other validation tools

| top | pdf |

PROCHECK is merely one of a number of validation tools that are freely available, some of which are mentioned elsewhere in this volume. The best known are WHATCHECK (Hooft et al., 1996[link]), PROVE (Pontius et al., 1996[link]), SQUID (Oldfield, 1992[link]) and VERIFY3D (Eisenberg et al., 1997[link]). Tools such as OOPS (Kleywegt & Jones, 1996b[link]) or the X-build validation in QUANTA (MSI, 1997[link]) provide standard tests on the geometry of a structure and provide lists of residues with unexpected features, which make it easy to check electron-density maps at suspect points.

25.2.7. MolScript

| top | pdf |
P. J. Kraulisr*

25.2.7.1. Introduction

| top | pdf |

Visualization of the atomic coordinate data obtained from a crystallographic study is a necessary step in the analysis and interpretation of the structure. The scientist may use visualization for different purposes, such as obtaining an overview of the structure as a whole, or studying particular spatial relationships in detail. Different levels of graphical abstraction are therefore required. In some cases, the atomic details need to be visualized, while in other cases, high-level structural features must be displayed.

In the study of protein 3D structures in particular, there is an obvious need to visualize structural features at a level higher than atomic. A common graphical `symbolic language' has evolved to represent schematically hydrogen-bonded repetitive structures (secondary structure) in proteins. Cylinders or helical ribbons are used for α-helices, while arrows or ribbons show strands in β-sheets.

The domain of the MolScript program is the production of publication-quality images of molecular structures, in particular protein structures. The implementation of MolScript is based on two design principles: First, the program must allow both schematic and detailed graphical representations to be used in the same image. Second, the user must be able to control the precise visual appearance of the various graphics objects in as much detail as possible.

The original version of MolScript was written in Fortran77 and produced only PostScript output (Kraulis, 1991[link]). The current version (v2.1.2, as of January 1999) has been completely rewritten in the C programming language. The new version is almost completely compatible with previous versions. The main new features in version 2 are several new output formats, the interactive OpenGL mode (see below) and dynamic memory allocation for all operations.

This section reviews the basic features of MolScript. Detailed information about the program, including instructions on how to obtain the software, can be found at the official MolScript web site http://www.avatar.se/molscript/ , where the online manual is also available.

25.2.7.2. Input

| top | pdf |

The input to MolScript consists of the coordinate file(s) in standard PDB format and a script file which describes the orientation of the structures, the graphics objects to display and the graphics state parameters that control the visual appearance of the objects.

The script may be created automatically by the utility program MolAuto (see below), or manually by the user in a standard text editor. The script may invoke other external script files or command macros, and it may also contain in-line atomic coordinate data.

25.2.7.3. Graphics

| top | pdf |

The basic model for the execution of MolScript is that of a non-interactive image-creating script processor. There are two stages in the execution. First, the script is parsed and the graphics objects are created according to the commands. This stage is essentially independent of the output format. Second, when the end of the script has been reached, the image is rendered from the graphics objects according to the chosen output format.

25.2.7.3.1. The coordinate system

| top | pdf |

The viewpoint in the MolScript right-handed coordinate system is always located on the positive z axis, looking towards the origin, with the positive x axis to the right. The user obtains the desired view of the structure by specifying rotations and translations of the atomic coordinates; it is not possible to change the location or the direction of the viewpoint.

There are two main benefits with this scheme: The first is that it is similar to the way we handle objects in everyday life: we do not normally fly around the object, but rather move it about with our hands. The other benefit is that together with the coordinate copy feature it can be used to compose an image containing several geometrically related subunits.

The disadvantage is that the atomic coordinates must be transformed before the creation of the graphics objects. This may complicate the composition of an image where another structure or geometric object is to be included. For example, if two separate structures have been aligned structurally by some external procedure, then the user must take care not to destroy the alignment in the process of setting the viewing transformation.

25.2.7.3.2. The graphics state

| top | pdf |

The graphics state consists of the parameters that determine the exact visual appearance of the graphics objects. The default values of the graphics state parameters are reasonable, so that an image of acceptable quality can be produced quickly. However, to obtain high-quality images which emphasize the relevant structural features, the user must usually fine-tune the rendering by modifying the graphics state parameters appropriately for the various graphics objects.

The graphics state may be modified by using the `set' command at any point in the script file. The change can have an effect only on graphics commands below that point in the script. When a graphics command is processed, the object is created according to the current values of the graphics state parameters at that exact point in the script file. It is this property of the graphics state that gives the user a very high degree of control over the composition and appearance of the graphics objects.

A new feature in version 2 of MolScript is the ability to set the colour of residues on a residue-by-residue basis in schematic secondary-structure representations. It is also possible to set the colour of atoms and residues according to a linear function of the B factors.

25.2.7.3.3. Graphics commands

| top | pdf |

The graphics commands create the graphics objects to be rendered in the final image. The commands need an atom or residue selection (see below) as argument. The visual attributes, and in some cases the dimensions, of the objects are determined by the graphics state parameters.

The graphics objects include the most common ways to represent atoms, such as simple line drawings, ball-and-stick models and CPK (Corey–Pauling–Koltun) spheres approximating the van der Waals radii of the atoms. The graphics objects representing high-level structures are mainly designed for protein structures, and comprise arrows for β-sheet strands, cylinders or helical ribbons for α-helices, and coils for non-repetitive peptide chain structures. The coil object can also be used to represent oligonucleotide backbone structures.

25.2.7.3.4. Atom and residue selection

| top | pdf |

A set of basic atom and residue selections are provided in MolScript for use as arguments to the graphics commands. Arbitrary subsets of atoms or residues can be specified by joining together the basic selections using a form of Boolean operators. Unfortunately, the Boolean expression feature may sometimes be difficult to understand for the non-expert user. One should consider the entire expression as a test to be applied to every atom or residue. Any atom or residue for which the Boolean expression evaluates to `true' will be selected as argument for the command.

25.2.7.3.5. External objects

| top | pdf |

Externally defined objects described by points, lines or triangular surfaces may be included in the image. The objects may optionally be transformed by the most recent transformation applied to the coordinate data. This feature allows import of arbitrary geometry created by some external software, e.g. molecular surfaces, electron-density representations or electrostatic field lines. The graphics state parameters apply to the rendering of the external objects in the image.

25.2.7.4. Output

| top | pdf |

The current implementation of the MolScript source code makes it possible to add new output formats. The intention is that all output formats should produce visually identical images given the same input. Unfortunately, this goal is hard to achieve due to various technical issues, such as the different formalisms used to describe lighting and material properties.

25.2.7.4.1. PostScript

| top | pdf |

PostScript (Adobe Systems Inc., 1985[link]) is a page description language for controlling high-quality printers. More information can be found at the Adobe Inc. web site, http://www.adobe.com/print/ .

The PostScript output mode relies on the painter's algorithm for hidden-surface removal. The most distant graphics segments are output first, continuing with the segments closer to the viewpoint, which may obliterate previously rendered segments. The implementation of this procedure is straightforward, and gives good results provided that the graphics objects are subdivided into sufficiently small segments. The PostScript mode allows more than one plot (image) to be rendered on a single page.

25.2.7.4.2. Raster3D

| top | pdf |

The Raster3D suite of programs (Merritt & Bacon, 1997[link]) produces high-quality images using a ray-tracing algorithm. MolScript can produce the input file required for the `render' program, which is the core program of the Raster3D suite. The web site for Raster3D is http://www.bmsc.washington.edu/raster3d/ .

The Raster3D mode features highlighting, transparency and shadows to produce MolScript images of very high visual quality.

25.2.7.4.3. VRML97

| top | pdf |

The VRML97 standard (Virtual Reality Modeling Language, formerly VRML 2.0) allows storage and transmission of 3D scenes in a system-independent manner over the web. Software to view VRML97 files is typically included in any modern web browser. A web site containing more information on VRML97 is http://www.web3d.org/x3d/specifications/vrml/ .

The VRML97 mode allows hyperlinking of objects. The MolScript implementation is optimized to produce output files that are as small as possible, but the file size is strongly dependent on the value of the `segments' parameter in the graphics state.

25.2.7.4.4. OpenGL

| top | pdf |

OpenGL is a standard API (Applications Programming Interface) for interactive 3D graphics. It is available on most current computer systems. For more information, see the web site http://www.opengl.org/ .

The OpenGL output mode allows a certain degree of interactivity, in contrast to the other output modes. It is possible to initiate execution of the MolScript program in OpenGL mode in one window on the screen, while keeping the script file in a separate text-editor window. The image is rotatable in 3D in the OpenGL window. The script can be edited in its window, and the modified script can be re-read and displayed directly by the MolScript program in its OpenGL window. This simplifies to some extent the iterative fine-tuning of the script.

25.2.7.4.5. Image files

| top | pdf |

Raster image files in several different formats can be created by MolScript. Currently these include SGI RGB, encapsulated PostScript (EPS), JPEG, PNG and GIF image formats. The JPEG, PNG and GIF formats require that external software libraries are available during the compilation and linking of the MolScript program. Software libraries for several of these image formats are available on the web; links are given at the official MolScript web site http://www.avatar.se/molscript/ .

The image file formats essentially capture the raster image created by the OpenGL implementation. The EPS format was a variant of the PostScript output mode in version 1 of MolScript, but for various reasons this has changed in version 2 to an encoding of the OpenGL raster image.

25.2.7.5. Utilities

| top | pdf |

A utility program called MolAuto is included in the MolScript software distribution. It reads a standard-format PDB coordinate file to produce a first-approximation script file for MolScript. This is a simple way to produce a starting point for further manual editing.

The MolAuto and MolScript programs have been designed to work well as software tools in the UNIX environment. This allows the programs to be embedded in more comprehensive software systems for automated creation and/or storage of images. An example of such a system is the web interface to the RCSB Protein Data Bank (PDB, http://www.rcsb.org/pdb/ ), which employs MolScript (among other tools) for visualization of the coordinate data sets.

25.2.8. MAGE, PROBE and kinemages

| top | pdf |
D. C. Richardsons and J. S. Richardsons*

25.2.8.1. Introduction to aims and concepts

| top | pdf |

MAGE and the kinemages it displays (Richardson & Richardson, 1992[link], 1994[link]) provide molecular graphics, organized in an unusual way, that are of interest to crystallographers for uses that range from interactive illustrations for teaching to a representation of all-atom van der Waals contacts, calculated by PROBE (Word, Lovell, LaBean et al., 1999[link]), to help guide model-to-map fitting.

A kinemage (`kinetic image') is an authored interactive 3D illustration that allows open-ended exploration but has viewpoint, explanation and emphasis built in. A kinemage is stored as a human-readable flat ASCII text file that embodies the data structure and 3D plotting information chosen by its author or user. MAGE is a pure graphics display program designed to show and edit kinemages, while PREKIN constructs molecular kinemages from PDB (Protein Data Bank; Research Collaboratory for Structural Bioinformatics, 2000[link]) files. The latest versions (currently 5.7) of MAGE and PREKIN are available free for Macintosh, PC, Linux, or UNIX from the kinemage web site (Richardson Laboratory, 2000[link]). The programs operate very nearly equivalently on different platforms and, by policy, later versions of MAGE can display all older kinemages. A Java `Magelet' can show small kinemages directly in suitable web browsers, with their first-level interactive capabilities of rotation, identification, measurement, views and animation.

MAGE has no internal knowledge of molecular structure. A collaboration between the author and the authoring program (e.g. PREKIN) builds data organization into the kinemage itself. This two-layer approach has great advantages in flexibility, since an author can show things the programmer never imagined, including non-molecular 3D relationships. Overall, kinemages demand less work and less expertise from the reader or viewer than do traditional graphics programs, but that ease of use depends on the effort involved in thoughtful authoring choices, aided by the extensive on-screen editing capabilities described below.

MAGE has been designed to optimize visual comprehension: the understanding and communication of specific 3D relationships inside complex molecules. Display speed has been given priority, to ensure good depth perception from smooth real-time rotation. The interface is extremely simple and transparent, and the colour palette is tuned for comparisons, contrasts and depth cueing. Immediate identification and measurement are always active; views, animations, or bond rotations can be built in by the kinemage author. Text and caption windows explain the intentions of the author, while a simple hypertext capability allows the reader to jump to the specific view and display objects being described; however, most kinemages can also be successfully understood just by exploring what is available within the graphics window.

Kinemages are suitable for structure browsing or producing static 2D presentation graphics, but those aspects have been kept secondary to effectiveness for interactive visualization and flexibility of author specification. Features and representations have deliberately been chosen to be fast, simple and informative rather than either showy or traditional, as illustrated by the following examples and their rationales. Mouse-controlled rotation in MAGE depends only on the direction of drag, so that the behaviour of the image is independent of absolute cursor position within the window. Labels are available but seldom needed, since the data structure builds in a `pointID' that is displayed whenever the point is picked. Instead of using half-bond colouring which tends to chop up the image, PREKIN provides separate colours and button controls for main chain versus side chains, and it can prepare a partial `ball-and-stick' representation with colour-coded balls on non-carbon, non-hydrogen atoms (see Fig. 25.2.8.1[link]). Hydrogen atoms are crucial for some research uses, but to minimize the clutter from twice as many atoms, PREKIN sets up their display under button control; in addition, a `lens' parameter can be specified for the list, allowing display only within a radius of the last picked centre point. For effective perception of conformational change, while avoiding either the confusion of overlays or the potential misrepresentation of computed interpolation, MAGE features simple animation switching between known conformations. Very importantly, since molecular information resides mostly in chemical bonds and spatial proximity, kinemages emphasize fully 3D representations, such as vectors, dots, or `ball and stick's, rather than surface graphics that obscure internal structure. A space-filling representation (the `spherelist') is available, but it is suggested that it is used very sparingly – for example, to show the size and shape of a small-molecule ligand. If an extensive surface is needed, a dot surface is more informative, since the underlying atoms and bonds can be seen at the same time. Nothing matches a well rendered ribbon for conveying overall `fold'; PREKIN calculates and MAGE displays simple ribbon schematics (see Fig. 25.2.8.2[link]) which can be rendered by Raster3D (Merritt, 2000[link]) or POV-Ray (POV-Ray Team, 2000[link]) for a static 2D illustration, but for interactive use they serve mainly as introduction and context for more detailed `ball-and-stick', vector and dot representations.

[Figure 25.2.8.1]

Figure 25.2.8.1 | top | pdf |

A typical macromolecular kinemage, combining details with context in the interactive display, for a glucocorticoid receptor–DNA complex (PDB file 1GLU). This view looks down the recognition helix, with one of the 4-Cys Zn sites on the right. Two sequence-specific binding interactions are shown with partial `ball-and-stick' representation: the Arg–guanine double hydrogen bond and the hydrophobic packing of Val to thymine methyl. DNA bases are in gold and protein side chains in pink, while atom balls are colour-coded as N blue, O red, C green, S yellow and Zn grey. Context is provided by the Cα backbone for the protein and a virtual backbone for the DNA (using P, C4′ and C1′), with lines symbolizing the rest of the base pairs.

[Figure 25.2.8.2]

Figure 25.2.8.2 | top | pdf |

A ribbon-schematic kinemage of ribonuclease A (PDB file 7RSA), with β strands as sea-green arrows, helices as gold spirals and loops as single splines in white (produced from a built-in script in PREKIN and rotatable in MAGE). Ribbons have edges to give them some thickness and are shaded rather than depth-cued; Cα positions for the active-site His side chains were moved slightly to lie in the ribbon plane.

For kinemages, the representation style is not a global choice that applies to everything shown, but rather is a set of local options (varied across space or sequence) chosen to provide appropriate emphasis and comprehensible detail within context.

25.2.8.2. Use as a reader of existing kinemages

| top | pdf |

Viewing a pre-existing kinemage file requires almost no learning process: the interface is sufficiently `transparent' that interaction is mainly with the molecule rather than with the program. Six simple operations cover all basic functionalities: (1) drag with the mouse to rotate the displayed object; (2) click on a point to identify it; (3) turn things on or off, or animate if that option is present, with labelled buttons; (4) choose preset views from the Views pull-down menu; (5) read the author's explanations in the text and caption windows; (6) change to the next kinemage in the file with the Kinemage pull-down menu. At a slightly more complex level, one can recentre, zoom the scale, move the clipping planes and save a view; measure distances, angles and dihedrals or `Find' by point name (from the Tools pull-down menu); change Display menu options such as stereo or perspective; or consult the Help menu. There are keyboard shortcuts for convenience (such as `a' to animate or `c' for cross-eye versus wall-eye stereo), but they are never the only method and they are defined on the menus. Demo5_4a.kin (Richardson Laboratory, 2000[link]) provides a brief guided introduction to using kinemages.

25.2.8.3. Use for teaching

| top | pdf |

Simplicity of interface, attention to presentation issues and free cross-platform availability make MAGE and kinemages especially well suited for teaching and learning about macromolecular structure or about crystallographic concepts such as handedness and symmetry. Suggestions can be found in Richardson & Richardson (1992[link], 1994[link]) and in file KinTeach.txt (Richardson Laboratory, 2000[link]). A large body of teaching material is available in kinemage form, including supplements for textbooks on protein structure and general biochemistry, the Protein Tourist files and kinemages for specific papers in Protein Science, and a great many web sites involving kinemages, some of which contain course materials (e.g. Bateman, 2000[link]). References, links and examples can be found on the kinemage web site (Richardson Laboratory, 2000[link]).

25.2.8.4. Use for research

| top | pdf |

For general molecular-structure studies, kinemages act as a 3D laboratory notebook where author and reader are the same person. These kinemages keep a visual record of the research process with selections, views, labels, measurements, superpositions etc., plus a descriptive record in the text and caption windows. Setting up an animation between conformations or between related structures is an easy and very sensitive way of seeing changes, including correlated motions. Completely new display objects and organizations can be added to kinemages, such as 3D plots of related non-molecular data. Kinemages are an easy and platform-independent way of sharing ideas with collaborators, either side-by-side or at a distance with simultaneous discussion, or just by sending a kinemage with its preset views and notes. Later, the working research kinemages can be used to produce either static 2D or interactive illustrations for lectures or publication.

In addition, MAGE and PROBE incorporate research tools not yet available in other display systems. PROBE analyses molecular interactions by calculating small-probe contact dots wherever two atoms are within 0.5 Å of van der Waals contact (Word, Lovell, LaBean et al., 1999[link]), for numerical scoring or for display in MAGE (see Fig. 25.2.8.3[link]), where the three types of contacts (hydrogen bonds, favourable van der Waals contacts and unfavourable `clash' overlaps) are under separate control. Contact-dot analysis requires all hydrogen atoms; they are added by REDUCE (Word, Lovell, Richardson & Richardson, 1999[link]), which optimizes the positions of OH, SH, NH3 and Met CH3 hydrogen atoms and possible 180° flips of Asn, Gln, or His, considering both van der Waals clashes and hydrogen bonds analysed combinatorially in local networks. These contact-surface tools have research uses that fall into two distinct categories: one is study of the patterns and causes of particular structural features in molecules (best done on atomic resolution structures); the other is sensitive testing, validating and adjusting of an individual molecular model, either computational or experimental. As an example of the latter type, MAGE can call PROBE interactively for a real-time update of the all-atom contacts as bonds are rotated to find the best predicted position for a mutated side chain (Word et al., 2000[link]).

[Figure 25.2.8.3]

Figure 25.2.8.3 | top | pdf |

A thin slice through an all-atom contact kinemage showing the van der Waals interactions of Pro203 and neighbouring atoms in Zn elastase at 1.5 Å resolution (PDB file 1EZM). Contact dots are colour-coded by the gap between atoms: blue for wider gaps (up to 0.5 Å), green for closer fit and yellow for slight overlap (but still favourable). Note the extensive contact and the interdigitation of hydrogen atoms. White markers show the last two points picked (Pro Hg1 and one of its contact dots), while the distance between them and the identity of the last one are shown at bottom left. Hydrogen atoms are from REDUCE, contact dots are from PROBE and the display is done in MAGE.

25.2.8.5. Contact dots in crystallographic rebuilding

| top | pdf |

In crystallography, the most important use of contact dots is for quickly finding, and frequently for fixing, problems with molecular geometry during fitting and refinement. All-atom contact dots add independent new information to that process, since the hydrogen atoms make almost no contribution either to the electron density or to the energetic component of refinement as presently done, yet they are undeniably present and cannot significantly overlap other atoms. The steric constraints implied by all-atom contacts are significantly more stringent than those based only on non-hydrogen atoms, yet they are obeyed almost perfectly by low-B regions of structures at resolutions near 1 Å, even when hydrogen atoms were not used in refinement.

At any stage of a structure determination, contact dots for the entire molecule or molecules can be calculated by PROBE and examined in MAGE, or a list of the more severe clashes can be generated. However, it is most effective to use contact-dot information directly in the process of fitting and rebuilding. Therefore, for use with O (Jones et al., 1991[link]), there are macros that script a call to REDUCE to add hydrogen atoms on the fly, then to PROBE, and that show the resulting dots along with the map and model (Fig. 25.2.8.4[link]). XtalView (McRee, 1993[link], 1999[link]) has been modified to handle hydrogen atoms and to call PROBE interactively for update of the contact dots as the model is re-fitted. In either system, conformational choices often become unambiguous, even when the electron density alone does not distinguish them. This criterion can locate a Met methyl (Fig. 25.2.8.4[link]), find the correct orientation for the final branch of an Asn or Gln, or a Thr, Val or Leu, improve the backbone conformation of a Gly, disentangle alternate conformations, or show which direction a ligand binds, all at a lower resolution than otherwise possible. A much-improved rotamer library has been constructed using all-atom contact analysis on a high-resolution B-factor-edited database (Lovell et al., 2000[link]); it is available as drop-in files for O or XtalView, and also improves the fitting process.

[Figure 25.2.8.4]

Figure 25.2.8.4 | top | pdf |

All-atom contact dots being used in O for model rebuilding during refinement of a Trp tRNA synthetase at 2.3 Å resolution (courtesy of Charles W. Carter, University of North Carolina, Chapel Hill). The Met side-chain density is fairly round and accommodates the original fitting reasonably, but the contact dots show serious clashes (red spikes in the left panel). Rebuilding relieved all clashes (right panel), with equal or better fit to the map and to rotamer preferences. In this case, the electron density can be contoured lower to show a small but definite bulge in the direction of the rebuilt methyl.

25.2.8.6. Making kinemages

| top | pdf |

Setting up a research kinemage is very simple, since most decisions will be made later, during its use. For instance, to make a contact-dot kinemage one would run REDUCE on the PDB-format coordinate file to add hydrogen atoms, then run the `lots' script in PREKIN (which produces vectors for the main chain, side chains, hydrogen atoms and non-water heterogens, balls for waters, and pointID's that include B factors), and then run PROBE, appending its contact-dot output to the kinemage file. In UNIX, these three steps can be combined in a command-line script.

Making a kinemage for teaching, publication, or distribution is a more iterative and deliberate process. Since MAGE and PREKIN continue to evolve, it is an advantage to use the latest version (Richardson Laboratory, 2000[link]). For a first look at what is in the PDB file, accept the default `backbone browser' option in PREKIN, which will produce a Cα backbone (or a virtual backbone for nucleic acids, as in Fig. 25.2.8.1[link]), disulfides and non-water heterogen groups for all subunits in the file and will automatically launch MAGE, where one can decide what else to add. In a second PREKIN run, one can choose from a menu of built-in scripts such as main chain plus hydrogen bonds, `lots', ribbons (as in Fig. 25.2.8.2[link]) or Cα's plus all side chains grouped and coloured by type. One can also ask for specified items in a `focus' around a chosen residue. Alternatively, in the `New Ranges' dialogue box, one can specify combinations of main chain, side chains, hydrogen bonds, hydrogen atoms, waters, non-water heterogens, balls, ribbons, rotatable or mutated side chains etc. for any set of residue ranges. Subunits (or models, if NMR) are chosen in a final dialogue. The resulting kinemage file will then be displayed and modified in MAGE.

On-screen editing of a kinemage in MAGE usually begins with setting up a few good views: rotate, pick centre, zoom and clip to optimize each one, and save it with `Keep Current View' on the Edit pull-down menu; it then shows up on the Views pull-down menu with the given label. Turning on `Change Color' (Edit menu) then picking any atom in an object allows selection of a new colour from the scrollable choices. Demo5_4b.kin shows the palette of colours with their names and gives some guidelines for choosing effective colours. Context is important for a kinemage (usually at least overall Cα's), but as much as possible should be deleted that is not directly relevant, while the features of current interest are emphasized (e.g. Fig. 25.2.8.1[link]). This selection process is like the simplification and emphasis needed for a good 2D illustration, but in this case it applies to the fully interactive 3D form. For a kinemage, however, it is both possible and advantageous to include some additional details for further exploration, controlled by a button which can start out turned off. For deleting things, `Prune' on the Edit menu activates four new buttons on the right-hand panel: `punch' removes the vectors on either side of a picked point, `prune' removes an entire connected line segment, `auger' removes everything within a marked circle on the screen and `undo p' recovers from a mistake, back for ten steps. If, for example, side chains are being shown in a focus around the active site, one could prune away those that don't interact at all, and then move the second-shell side chains to a separate list with the word `off' in its first line. `Text Editable' (Edit menu) enables writing explanations in the text and caption windows, while the graphics window is still active for reference. `Save As' (File menu) will save the whole edited kinemage file and reload to show the revised kinemage in its startup view. As well as a bitmap screen capture or files for rendering, a PostScript file can also be written to print out a 2D picture of the current graphics window, either in colour or `black on white'.

At this stage, a word processor can be used to look at the plain ASCII kinemage file, with its text, its views and the hierarchy of group, subgroup and list display objects in human-readable and clearly identified forms. Lists (e.g. @vectorlist {name}) can be of vectors, dots, labels, words, balls, spheres, triangles, or ribbons. Any part of the file can be edited, using its existing format as a guide or looking at another kinemage file that provides a desired template. Among the few operations that currently must be edited outside rather than inside MAGE are moving things between different lists or groups (for instance, setting up a new list of just active-site side chains in a different colour and controlled by their own button) and adding `master' buttons that control object display independent of the group heirarchy (e.g. side chains can be turned off and on together for all subunits or models if `master ={side ch}' is added to the first line of each of those lists). The kinemage should be saved without formatting, as a plain text file.

More complex modifications are possible in MAGE, using advanced on-screen editing and construction features from the Edit menu. `Draw new' activates tools that can add labels, draw hydrogen bonds (with shortened, unselectable lines) and make a variety of geometrical constructs by building out from the original atoms (e.g. add a Cβ to a Gly, or draw helix axes and measure their distance and angle). `Show Object Properties' lets one see, and edit, the names and parameters of the object hierarchy for any point picked, which allows renaming buttons, simplifying the button panel, adding animation, editing labels, or deleting entire display objects. `Remote Update', on the Tools menu, can call PREKIN to set up rotations for the last-picked side chain or a mutation of it, and can then call PROBE to update all-atom contacts interactively as the angles are changed. On the kinemage web site (Richardson Laboratory, 2000[link]), Demo5_4a.kin includes an introduction to the drawing tools and Demo5_4b.kin to the format and to editing. Make_kin.txt is a more complete tutorial on the process of constructing kinemages. Mage5_4.txt and Pkin5_4.txt document the features of the MAGE and PREKIN programs. File KinFmt54.txt (which also constitutes the MIME standard chemical/x-kinemage) is a formal description of the kinemage format for 3D display.

All in all, making a simple kinemage is trivial, but making really good ones for use by others is much like making a good web page. There are tools that make the individual steps easy, but one needs to exercise restraint to keep it simple enough to be both fast and comprehensible, patience to keep looking at the result and modifying it where needed, and judgment about both content and aesthetics.

25.2.8.7. Software notes

| top | pdf |

MAGE and PREKIN were written in C for Macintosh, PC, Linux, SGI and other UNIX platforms by David C. Richardson, who also maintains and extends them (with the help of Brent K. Presley for the Windows 95/98/NT port). PROBE (in C) and REDUCE (in C++) were written by J. Michael Word for SGI UNIX, Linux and PC Windows, but can be compiled on other platforms. The contact-dot additions to O and XtalView were written by Simon C. Lovell, J. Michael Word and Duncan E. McRee. For the modified XtalView (version 4.0), see http://www.sdsc.edu/CCMS/Packages/XTALVIEW/xtalview.html ; for O scripts and files, see http://xray.bmc.uu.se/usf/ ; the rest of the software, plus source and documentation files, is available free from the kinemage web or ftp site (Richardson Laboratory, 2000[link]).

25.2.9. XDS

| top | pdf |
W. Kabscht*

25.2.9.1. Functional specification

| top | pdf |

The program package XDS (Kabsch, 1988a[link],b[link], 1993[link]) has been developed for the reduction of single-crystal diffraction data recorded on a planar detector by the rotation method using monochromatic X-rays. It includes a set of five programs:

  • (1) XDS accepts a sequence of adjacent, non-overlapping rotation images from a variety of imaging plate, CCD and multiwire area detectors and produces a list of corrected integrated intensities of the reflections occurring in the images. The program assumes that each image covers the same positive amount of crystal rotation and that rotation axis, incident beam and crystal intersect at one point, but otherwise imposes no limitations on detector position, or directions of rotation axis and incident beam, or on the oscillation range covered by each image.

  • (2) XPLAN provides information for identifying the optimal rotation range for collecting data. Based on detector position and unit-cell orientation obtained from evaluating one or a few rotation images using XDS, it reports the expected completeness of the data by simulating measurements at various rotation ranges specified by the user, thereby taking into account already-measured reflections.

  • (3) XSCALE places several data sets on a common scale, optionally merges them into one or several sets of unique reflections, and reports their completeness and quality of integrated intensities.

  • (4) VIEW displays rotation-data images as well as control images produced by XDS. It is used for checking the correctness of data processing and for deriving suitable values for some of the input parameters required by XDS. This program was coded in the computer language C by Werner Gebhard at the Max-Planck-Institut für medizinische Forschung in Heidelberg. The other programs are written in Fortran77, with the exception of a few C subroutines provided by Abrahams (1993)[link] for handling compressed images.

  • (5) XDSCONV converts reflection data files as obtained from XDS or XSCALE into various formats required by software packages for crystal structure determination. Test reflections previously selected for monitoring the progress of structure refinement may be inherited by the new output file, which simplifies the use of new data or switching between different structure-determination packages.

25.2.9.2. Components of the package

| top | pdf |

25.2.9.2.1. XDS

| top | pdf |

XDS is organized into eight steps (major subroutines) which are called in succession by the main program. Information is exchanged between the steps by files (see Table 25.2.9.1)[link], which allows repetition of selected steps with a different set of input parameters without rerunning the whole program. ASCII files can be inspected and modified using a text editor, whereas types DIR and BIN indicate binary random access and unformatted sequential access files, respectively. All files have a fixed name defined by XDS, which makes it mandatory to process each data set in a newly created directory. Clearly, one should not run more than one XDS job at a time in any given directory. Output files affected by rerunning selected steps (see Table 25.2.9.1)[link] should also first be given another name if their original contents are meant to be saved.

Table 25.2.9.1| top | pdf |
Information exchange between program steps of XDS

Program stepInput filesOutput files
NameTypeNameType
XYCORR XDS.INP ASCII XYCORR.LP ASCII
      XYCORR.TBL DIR
      FRAME.pck BIN
INIT XDS.INP ASCII INIT.LP ASCII
  XYCORR.TBL DIR BKGPIX.TBL DIR
      BLANK.TBL DIR
      BKGPIX.IMG DIR
COLSPOT XDS.INP ASCII COLSPOT.LP ASCII
  BKGPIX.TBL DIR SPOT.XDS ASCII
  BLANK.TBL DIR BKGPIX.IMG DIR
  XYCORR.TBL DIR FRAME.pck BIN
IDXREF XDS.INP ASCII IDXREF.LP ASCII
  SPOT.XDS ASCII SPOT.XDS ASCII
      XPARM.XDS ASCII
COLPROF XDS.INP ASCII COLPROF.LP ASCII
  XPARM.XDS ASCII XREC.XDS BIN
  BKGPIX.TBL DIR BKGPIX.IMG DIR
  BLANK.TBL DIR FRAME.pck BIN
  XYCORR.TBL DIR    
PROFIT XDS.INP ASCII PROFIT.LP ASCII
  XREC.XDS BIN PROFIT.HKL DIR
CORRECT XDS.INP ASCII CORRECT.LP ASCII
  PROFIT.HKL DIR NORMAL.HKL ASCII
      ANOMAL.HKL ASCII
      XDS.HKL DIR
      MISFITS ASCII
GLOREF XDS.INP ASCII GLOREF.LP ASCII
  PROFIT.HKL DIR GXPARM.XDS ASCII

Data processing begins by copying an appropriate input file into the new directory. Input-file templates are provided with the XDS package for a number of frequently used data-collection facilities. The copied input file must be renamed XDS.INP and edited to provide the correct parameter values for the actual data-collection experiment. All parameters in XDS.INP are named by keywords containing an equal sign as the last character, and many of them will be mentioned here in context to clarify their meaning. Execution of XDS (JOB= XDS) invokes each of the eight program steps as described below. Results and diagnostics from each step are saved in files with the extension LP attached to the program step name. These files should always be studied carefully to see whether processing was satisfactory or – in case of failure – to find out what could have gone wrong.

XYCORR calculates a lookup table of additive spatial corrections at each detector pixel and stores it in the file XYCORR.TBL. The data images are often already corrected for geometrical distortions, in which case XYCORR produces a table of zeros or – as for spiral read-out imaging plate detectors – computes the small corrections resulting from radial (ROFF=) and tangential (TOFF=) offset errors of the scanner. For some multiwire and CCD detectors that deliver geometrically distorted images, corrections are derived from a calibration image (BRASS_PLATE_IMAGE= file name). This image displays the response to a brass plate containing a regular grid of holes which is mounted in front of the detector and illuminated by an X-ray point source, e.g. 55Fe. Clearly, the source must be placed exactly at the location to be occupied by the crystal during the actual data collection, as photons emanating from the calibration source are meant to simulate all possible diffracted beam directions. For visual control using the VIEW program, spots that have been located and accepted from the brass-plate image by XYCORR are marked in the file FRAME.pck.

Problems: (a) A misplaced calibration source leads to an incorrect lookup table, impairing the correct prediction of the observed diffraction pattern in subsequent program steps. (b) Underexposure of the calibration image results in an incomplete and unreliable list of calibration spots.

INIT estimates the initial background at each pixel and determines the trusted region of the detector surface. The total background at each pixel is the sum of the detector noise and the X-ray background. The detector noise, saved in the lookup table BLANK.TBL, is determined from a specific image recorded in the absence of X-rays (DARK_CURRENT_IMAGE=) or is assumed to be a constant derived from the mean recorded value in each corner of the data images. A lookup table of the X-ray background, saved on the file BKGPIX.TBL, is obtained from the first few data images by the following two-pass procedure. To exclude diffraction spots in the data image, the minimum of the five values at (x, y), [{(x\pm \hbox{d}x,y)(x,y\pm \hbox{d}y)}] is used as a lower background estimate at pixel (x, y) in the first pass. In the second pass, the background is taken as the maximum of the lower estimates at these five locations. Ideally, the parameters SPOT_WIDTH_ALONG_X= 2*dx + 1, SPOT_WIDTH_ALONG_Y= 2*dy + 1 are chosen to match the extent of a spot. The lookup table is obtained by adding the X-ray background from each image. Shaded regions on the detector (i.e. from the beam stop), pixels outside a user-defined circular region (RMAX=) or pixels with an undefined spatial correction value are classified as untrustworthy and marked by −3. The table should be inspected using the VIEW program.

Problems: (a) The addition of background from too many data images may exceed 262 144 at some pixels, which are removed from the trusted detector region due to internal number overflow. (b) Some detectors with insufficient protection from electromagnetic pulses may generate badly spoiled images whose inclusion leads to a completely wrong X-ray background table. These images can be identified in INIT.LP by their unexpected high mean pixel contents, and this step should be repeated with a different set of images.

COLSPOT locates, at most, 500 000 strong diffraction spots occurring in a subset of the data images and saves their centroids on the file SPOT.XDS. Up to ten ranges of contiguous images (SPOT_RANGE=) may be specified explicitly; otherwise, spots are taken from the first few data images, covering a total rotation range of 5°. Spots are located automatically by comparing each pixel value with the mean value and standard deviation of surrounding pixels, as described in Chapter 11.3[link] . A lower threshold for accepting pixels and a minimum required number of such pixels within a spot can be defined in XDS.INP by the parameters MINIMUM_SIGNAL_TO_NOISE_FOR_LOCATING_SPOTS= and MINIMUM_NUMBER_OF_PIXELS_IN_A_SPOT=, respectively.

Problem: Sharp edges like ice rings in the images can lead to an excessive number of pixels erroneously classified as contributing to a diffraction spot which extends over many adjacent images, thereby causing a hash-table overflow. The problem can be avoided by specifying non-adjacent images for spot search.

IDXREF uses the initial parameters describing the diffraction experiment as provided by XDS.INP and the observed centroids of the spots occurring in the file SPOT.XDS to find the orientation, metric and symmetry of the crystal lattice, and it refines all or a specified subset of these parameters. On return, the complete set of parameters are saved in the file XPARM.XDS, and the original file SPOT.XDS is replaced by a file of identical name – now with indices attached to each observed spot. Spots not belonging to the crystal lattice are given indices 0, 0, 0. XDS considers the run successful if at least 70% of the given spots can be explained with reasonable accuracy; otherwise, XDS will stop with an error message. Alien spots often arise because of the presence of ice or small satellite crystals, and continuation of data processing may still be meaningful. In this case, XDS is called again with an explicit list of the subsequent steps specified in XDS.INP.

Using and understanding the results reported in IDXREF.LP requires a knowledge of the concepts employed by this step, as described in Chapter 11.3[link] . First, a reciprocal-lattice vector, referring to the unrotated crystal, is computed from each observed spot centroid. Differences between any two reciprocal-lattice vectors that are above a specified minimal length (SEPMIN=) are accumulated in a three-dimensional histogram. These difference vectors will form clusters in the histogram, since there are many different pairs of reciprocal-lattice vectors of nearly identical vector difference. The clusters are found as maxima in the smoothed histogram (CLUSTER_RADIUS=), and a basis of three linearly independent cluster vectors is selected that allows all other cluster vectors to be expressed as nearly integral multiples of small magnitude with respect to this basis. The basis vectors and the 60 most populated clusters with attached indices are listed in IDXREF.LP. If many of the indices deviate significantly from integral values, the program is unable to find a reasonable lattice basis and all further processing will be meaningless.

If the space group and cell constants are specified, a reduced cell is derived, and the reciprocal-basis vectors found above are reinterpreted accordingly; otherwise, a reduced cell is determined directly from the reciprocal basis. Parameters of the reduced cell, coordinates of the reciprocal-basis vectors and their indices with respect to the reduced cell are reported.

Based on the orientation and metric of the reduced cell now available, IDXREF indexes up to 3000 of the strongest spots by the local-indexing method. This method considers each spot as a node of a tree and identifies the largest subtree of nodes which can be assigned reliable indices. The number of reflections in the ten largest subtrees is reported and usually shows a dominant first tree corresponding to a single lattice, whereas alien spots are found in small subtrees. Reflections in the largest subtree are used for initial refinement of the basis vectors of the reduced cell, the incident beam wave vector and the origin of the detector, which is the point in the detector plane nearest to the crystal. Experience has shown that the detector origin and the direction of the incident beam are often specified with insufficient accuracy, which could easily lead to a misindexing of the reflections by a constant offset. For this reason, IDXREF considers alternative choices for the index origin and reports their likelihood for being correct. Parameters controlling the local indexing are INDEX_ERROR=, INDEX_MAGNITUDE=, INDEX_QUALITY= (corresponding to [epsilon], δ, [1-\ell_{\min}] in Chapter 11.3[link] ) and INDEX_ORIGIN= h0, k0, l0, which is added to the indices of all reflections in the tree. After initial refinement based on the reflections in the largest subtree, all spots that can now be indexed are included. Usually, the detector distance and the direction of the rotation axis are not refined, but if the spots were extracted from images covering a large range of total crystal rotation, better results are obtained by including these parameters in the refinement (REFINE=).

The refined metric parameters of the reduced cell are used for testing each of the 44 possible lattice types, as described in Chapter 11.3[link] . For each lattice type, IDXREF reports the likelihood of being correct, the conventional cell parameters and the linear transformation relating original indices to the new indices with respect to the conventional cell. However, no automatic decisions for space-group assignment are made by XDS. If the space group and cell constants are provided by the user, the reduced-cell vectors are reinterpreted accordingly; otherwise, data processing continues with the crystal being described by its reduced-cell basis vectors and triclinic symmetry. On completion, when integrated intensities are available, the user chooses any plausible space group according to the rated list of the 44 possible lattice types and repeats only the CORRECT and GLOREF steps with the appropriate conventional cell parameters and reindexing transformation (see below).

Problems: (a) Indices of many difference-vector clusters deviate significantly from integral values. This can be caused by incorrect input parameters, such as rotation axis, oscillation angle or detector position, by a large fraction of alien spots in SPOT.XDS, by placing the detector too close to the crystal, or by inappropriate choice of parameters SEPMIN= and CLUSTER_RADIUS= in densely populated images. (b) Indexing and refinement is unsatisfactory despite well indexed difference-vector clusters. This is probably caused by selection of an incorrect index origin, and IDXREF should be rerun with plausible alternatives for INDEX_ORIGIN= after a visual check of a data image with the VIEW program.

COLPROF extracts the three-dimensional profile of each reflection predicted to occur in the rotation images within the trusted region of the detector surface and saves all profiles on the file XREC.XDS. A scaling factor is determined for each image, derived by comparing its background region (after subtraction of the detector noise) with the current X-ray background table. This table, initially obtained from the file BKGPIX.TBL, is updated by the background from each data image at a rate defined by the input parameter BFRAC=. For visual control, the contents of the updated X-ray background table are saved on the file BKGPIX.IMG at the end of this program step. Information for predicting reflection positions is initially provided by the file XPARM.XDS. These parameters are either kept constant or refined periodically using centroids of the most recently found strong diffraction spots as data reduction proceeds (REFINE=, NUMBER_OF_FRAMES_BETWEEN_REFINEMENT_IN_COLPROF=, NUMBER_OF_REFLECTIONS_USED_FOR_REFINEMENT_IN_COLPROF=, WEAK=).

In order to include all pixels contributing to the intensity of a spot, approximate values describing their extension and form must be specified, as defined in Chapter 11.3[link] by the parameters δM, σM, δD, σD. The value for BEAM_DIVERGENCE= δD = arctan(spot diameter/detector distance) is found by measuring the diameter of a strong spot in a data image displayed by the VIEW program and should include a few adjacent background pixels. The form of a spot is roughly described as a Gaussian and its standard deviation is specified by the parameter BEAM_DIVERGENCE_E.S.D.= σD, which is usually about one-sixth to a tenth of δD. Similarly, REFLECTING_RANGE= δM is the approximate rotation angle required for a strong spot recorded perpendicular to the rotation axis to pass completely through the Ewald sphere. The standard deviation of the intensity distribution is given by the mosaicity REFLECTING_RANGE_E.S.D.= σM. Thus, a three-dimensional domain of pixels belonging to each reflection is defined by the above parameters, and the program automatically removes pixels contaminated by neighbouring reflections. It determines and subtracts the background, corrects for spatial distortions, and maps each pixel content into a reflection-specific coordinate system centred on the Ewald sphere (see Chapter 11.3[link] ). The form of these profiles is then similar for all reflections, and their mean obtained from superimposition of strong reflections is reported at regular intervals. On return from this step, the data image last processed with all expected spots encircled is saved in the file FRAME.pck for inspection using the VIEW program.

Problems: (a) Off-centred profiles indicate incorrectly predicted reflection positions by using the parameters provided by the file XPARM.XDS (i.e. misindexing by using a wrong origin of the indices), crystal slippage, or change in the incident beam direction. (b) Profiles extending to the borders of the box indicate too-small values for BEAM_DIVERGENCE= or REFLECTING_RANGE=. This leads to incorrect integrated intensities because of truncated reflection profiles and unreliable background determination. (c) Display of the file FRAME.pck showing spots which are not encircled. If these unexpected reflections are not close to the spindle and are not ice reflections, it is likely that the parameters provided by the file XPARM.XDS are wrong.

PROFIT estimates intensities from the three-dimensional profiles of the reflections stored in the input file XREC.XDS and saves the results in the file PROFIT.HKL. In the first pass, templates are generated by superimposing profiles of fully recorded strong reflections, and all grid points with a value above a minimum percentage of the maximum in the template (CUT=) are defined as elements of the integration domain. To allow for variations of their shape, profile templates are generated from reflections located at nine regions of equal size covering the detector surface and additional sets of nine to cover equally sized batches of images. Standard deviations, REFLECTING_RANGE_E.S.D.= and BEAM_DIVERGENCE_E.S.D.=, observed for each profile template are reported and – in the case of large discrepancies – could be used for rerunning COLPROF with better values for these parameters. In the second pass, intensities and their standard deviations are estimated by fitting the reflection profile to its template, as described in Chapter 11.3[link] . Overloaded (OVERLOAD=) or incomplete reflections covering less than a minimum percentage of the template volume (MINPK=) or reflections with unreliable background are excluded from further processing.

Problem: The program stops because there are no strong spots for learning profile templates. It is likely that parameters REFLECTING_RANGE=, BEAM_DIVERGENCE= etc., which define the box dimensions, have been incorrectly chosen. After correction, both the COLPROF and PROFIT step should be repeated.

CORRECT applies Lorentz and polarization correction factors as well as factors that partially compensate for radiation damage and absorption effects to intensities and standard deviations of all reflections found in the file PROFIT.HKL, and saves the results on the files XDS.HKL and either NORMAL.HKL or ANOMAL.HKL (if Friedel's law is broken, as specified by a positive value for the input parameter DELFRM=). These factors are determined from many symmetry-equivalent reflections usually found in the data images such that their integrated intensities become as similar as possible. The residual scatter of these intensities is a more realistic measure of their errors and is used to determine a correction factor for the standard deviations previously estimated from profile fitting. An initial guess for this factor (WFAC1=) is provided in XDS.INP and is used to identify outliers, which are collected in the file MISFITS for separate analysis.

Data quality as a function of resolution is described by the agreement of the intensities of symmetry-related reflections and quantified by the R factors Rsym and the more robust indicator Rmeas (Diederichs & Karplus, 1997[link]). These R factors as well as the intensities of all reflections with indices of type h00, 0k0 and 00l and those expected to be systematically absent are important indicators for identification of the correct space group. Clearly, large R factors or many rejected reflections (MISFITS) or large observed intensities for systematically absent reflections suggest that the assumed space group or the indexing is incorrect. It is easy to test other possible space groups (SPACE_GROUP_NUMBER=) by simply repeating the CORRECT and GLOREF steps after copying the appropriate reindexing transformation (REIDX=) and conventional cell constants (UNIT_CELL_CONSTANTS=) found in the rated table of the 44 possible lattice types in IDXREF.LP to XDS.INP. One should remember, however, that the final choice to be kept should be run last, as XDS overwrites earlier versions of the output files.

Another useful feature is the possibility of comparing the new data with those from a previously measured crystal (REFERENCE_DATA_SET= file name). For some space groups, like P42, possessing an ambiguity in the choice of axes, comparison with the reference data set allows one to identify the consistent solution from the complete set of alternatives already listed in IDXREF.LP together with their required index transformation. Reference data are also quite useful for recognizing misindexing or for testing potential heavy-atom derivatives.

Problems: (a) Incomplete data sets may lead to wrong conclusions about the space group, as some of its symmetry operators might not be involved in the R-factor calculations. (b) Conventional cell parameters, as listed in IDXREF.LP, often violate constraints imposed by the space group and must be edited accordingly after copying to XDS.INP.

GLOREF refines the diffraction parameters by using the observed positions of all strong spots contained in the file PROFIT.HKL. It reports the root-mean-square error between calculated and observed positions along with the refined unit-cell constants. Again, for testing possible space groups, the crystallographer consults the table printed by the IDXREF step and selects the appropriate reindexing transformation and starting values for the conventional cell constants. The refined diffraction parameters (after possible reindexing) are saved on the file GXPARM.XDS, which is identical in format to XPARM.XDS. Replacing XPARM.XDS with the new file offers a convenient way for repeating COLPROF, now with a better set of parameters.

Problem: GLOREF will fail if the crystal slips during data collection.

25.2.9.2.2. XPLAN

| top | pdf |

XPLAN supports the planning of data collection. It is based upon information provided by XPLAN.INP and the input files XPARM.XDS and BKGPIX.TBL, both of which become available on processing a few test images with XDS. XPLAN estimates the completeness of new reflection data, expected to be collected for each given starting angle and total crystal rotation, and reports the results for a number of selected resolution shells in the file XPLAN.LP. To minimize recollection of data, the name of a file containing already measured reflections can be specified in XPLAN.INP.

Problems: (a) Incorrect results may occur for some space groups, e.g. P42, if the unit cell determined by XDS from processing a few test images implicates reflection indices inconsistent with those from the already-measured data. The correct cell choice can be found, however, by using the old data as a reference and repeating CORRECT and GLOREF with the appropriate reindexing transformation, followed by copying GXPARM.XDS to XPARM.XDS. The same applies if IDXREF was run for an unknown space group and then reindexed in CORRECT and GLOREF. (b) XPLAN ignores potential reflection overlap due to the finite oscillation range covered by each image.

25.2.9.2.3. XSCALE

| top | pdf |

XSCALE accepts one or more files of type XDS.HKL, NORMAL.HKL or ANOMAL.HKL, as obtained from data processing with XDS or merged output files from DENZO (see Chapter 11.4[link] ), determines scaling factors for the reflection intensities, merges symmetry-equivalent observations into a unique set, and reports data completeness and quality in the file XSCALE.LP. The desired program action is specified in the file XSCALE.INP. It consists of a definition of shells used for analysing the resolution dependency of data quality and completeness, space-group number and cell constants, and one line for each set of input reflections. Each set has a file name, type identifier, a resolution window for accepting data, a weighting factor for the standard deviation of the reflection intensity, a decision constant for accepting Bijvoet pairs, a number controlling the degree of smoothness of the scaling function and an optional file name for specifying the output file.

Problem: All reflections beyond the highest shell specified for analysing the resolution dependency of data quality and completeness will be ignored regardless of the resolution window given for each data set.

25.2.9.2.4. VIEW

| top | pdf |

VIEW is used for visualizing data images as well as control images produced by XDS. It responds to navigation commands entered by movements of a mouse, and reports the corresponding image coordinates and their pixel contents upon activation of the mouse buttons. VIEW also allows magnification of selected image portions and changes in colour.

Problem: For many detector image formats as well as XDS produced images, the true pixel value is stored in a coded form which is interpreted by VIEW as a signed integer. Numbers less than −4095 displayed by VIEW correspond to large positive pixel values.

25.2.9.2.5. XDSCONV

| top | pdf |

XDSCONV accepts reflection-intensity data files as produced by XSCALE or CORRECT and converts them into a format required by software packages for structure determination. XDSCONV estimates structure-factor moduli based on the assumption that the intensity data set obeys Wilson's distribution and uses a Bayesian approach to statistical inference as described by French & Wilson (1978)[link]. For anomalous intensity data, both structure-factor amplitudes [F_{hkl}] and [F_{\bar{h}\bar{k}\bar{l}}] are simultaneously estimated from the Bijvoet intensity pair by a method similar to that described by Lewis & Rees (1983)[link] – which accounts for the correlation between [I_{hkl}] and [I_{\bar{h}\bar{k}\bar{l}}]. The output file generated may inherit test reflections previously used for calculating a free R factor (Brünger, 1992b[link]) or may contain new test reflections selected by XDSCONV.

25.2.9.3. Remarks

| top | pdf |

XDS is not an interactive program. It communicates with the input file XDS.INP and during the run accepts only a change in specification of the last image to be included in the data set (DATA_RANGE=) – a useful option when processing overlaps with data collection. To prevent the program from overtaking the measurements, a maximum delay should be set (MINUTE=) to be slightly longer than the time for generating the next image.

Experience has shown that the most frequent obstacle in using the package is the indexing and accurate prediction of the reflections occurring in the images. Usually, the problems arise from incorrect specifications of rotation axis, beam direction or detector position and orientation, oscillation range, or wavelength. The occurrence of gross errors can be reduced by using file templates of XDS.INP specifically tailored to the actual experimental set-up which require only small adjustments to the geometrical parameters.

However, even small errors in the specification of the incident beam direction or the detector position may lead to indices which are all offset by one reciprocal-lattice point, particularly if the initial list of diffraction spots was obtained from a few images covering a small range of crystal rotation. For this reason, IDXREF tests a few alternatives for the index origin and reports its results, such as the expected coordinates of the incident beam in the image, which can be checked by looking at a data image with the VIEW program. The user may then repeat the IDXREF step, thereby forcing the program to use a plausible alternative for the index origin.

It is recommended that all program steps are run on a few images to establish whether the indexing is correct and also to find reasonable values describing crystal mosaicity and spot size. Incorrect indexing may be apparent from large values of symmetry R factors or from comparison with a reference data set reported in the CORRECT step. Also, looking at the file FRAME.pck with the VIEW program should show the last data image processed with most of the observed diffraction spots circled. More accurate estimates of the parameters describing spot dimensions are reported in PROFIT.LP and should be used for updating these values in XDS.INP before starting data processing for all images.

Refinement of parameters controlling the predicted position of spots is carried out in the IDXREF, COLPROF and GLOREF step, which allows the user to adopt a variety of strategies. If all data images are available, spots should be extracted by COLSPOT from images equally distributed in the data set. If IDXREF is able to explain most of the spots, the refined parameters will be sufficiently accurate for the complete data processing, and refinements in the COLPROF step are unnecessary. In other cases, if processing overlaps with data collection or the first strategy was unsuccessful, IDXREF is based on spots extracted from the first few images and provides an initial parameter set which is periodically refined during COLPROF. This allows correction for slow crystal slippage or minor changes of the incident beam direction. Finally, if refinement in GLOREF was successful, the new values may be used to repeat COLPROF (without parameter refinements) and subsequent steps.

25.2.10. Macromolecular applications of SHELX

| top | pdf |
G. M. Sheldricku*

25.2.10.1. Historical introduction to SHELX

| top | pdf |

The first version of SHELX was written around 1970 for the solution and refinement of small-molecule and inorganic structures. In the meantime, it has become widely distributed and is used at some stage in well over half of current crystal structure determinations. Since small-molecule direct methods and Patterson interpretation algorithms can be used to locate a small number of heavy atoms or anomalous scatterers, the structure-solving program SHELXS has been used by macromolecular crystallographers for a number of years. More recently, improvements in cryocrystallography, area detectors and synchrotron data collection have led to a rapid increase in the number of high-resolution (< 2 Å) macromolecular data sets. The enormous increase in available computer power makes it feasible to refine these structures using algorithms incorporated in SHELXL that were initially designed for small molecules. These algorithms are generally slower but make fewer approximations [e.g. conventional structure-factor summation rather than fast Fourier transform (FFT)] and include features, such as anisotropic refinement, modelling of complicated disorder and twinning, estimation of standard uncertainties by inverting the normal matrix etc., that are routine in small-moiety crystallography but, for reasons of efficiency, are not widely implemented in programs written for macromolecular structure refinement. This account will be restricted to features in SHELX of potential interest to macromolecular crystallographers.

25.2.10.2. Program organization and philosophy

| top | pdf |

SHELX is written in a simple subset of Fortran77 that has proved to be extremely portable. The programs SHELXS (structure solution) and SHELXL (refinement) both require only two input files: a reflection file (name.hkl) and a file (name.ins) that contains crystal data, atoms (if any) and instructions in the form of keywords followed by free-format numbers etc. These programs write a file, name.res, that can be renamed or edited to name.ins for the next refinement and can output details of the calculations to name.lst. Although originally designed for punched cards, this arrangement is still quite convenient and has retained upwards compatibility for the last 30 years. The common first part of the filename is read from the command line by typing, e.g., `SHELXL name'. The programs are executed independently without the use of any hidden files, environment variables etc.

The programs are general for all space groups in conventional settings or otherwise and make extensive use of default settings to keep user input and confusion to a minimum. Particular care has been taken to test the programs thoroughly on as many computer systems and crystallographic problems as possible before they were released, a process that often required several years!

25.2.10.3. Heavy-atom location using SHELXS and SHELXD

| top | pdf |

One might expect that a small-molecule direct-methods program, such as SHELXS (Sheldrick, 1990[link]), that routinely solves structures with 20–100 unique atoms in a few minutes or even seconds of computer time would have no difficulty in locating a handful of heavy-atom sites from isomorphous or anomalous ΔF data. However, such data can be very noisy, and a single seriously aberrant reflection can invalidate a large number of probabilistic phase relations. The most important direct-methods formula is still the tangent formula of Karle & Hauptman (1956)[link]; most modern direct-methods programs (e.g. Busetta et al., 1980[link]; Debaerdemaeker et al., 1985[link]; Sheldrick, 1990[link]) use versions of the tangent formula that have been modified to incorporate information from weak reflections as well as strong reflections, which helps to avoid pseudo-solutions with translationally displaced molecules or a single dominant peak (the so-called uranium-atom solution). Isomorphous and anomalous ΔF values represent lower limits on the structure factors for the heavy-atom substructure and so do not give reliable estimates of weak reflections; thus, the improvements introduced into direct methods by the introduction of the weak reflections are largely irrelevant when they are applied to ΔF data. This does not apply when FA values are derived from a MAD experiment, since these are true estimates of the heavy-atom structure factors; however, aberrant large and small FA estimates are difficult to avoid and often upset the phase-determination process. A further problem in applying direct methods to ΔF data is that it is not always clear what the effective number of atoms in the cell should be for use in the probability formulae, especially when it is not known in advance how many heavy-atom sites are present.

25.2.10.3.1. The Patterson map interpretation algorithm in SHELXS

| top | pdf |

Space-group-general automatic Patterson map interpretation was introduced in the program SHELXS86 (Sheldrick, 1985[link]); completely different algorithms are employed in the current version of SHELXS, based on the Patterson superposition minimum function (Buerger, 1959[link], 1964[link]; Richardson & Jacobson, 1987[link]; Sheldrick, 1991[link], 1998a[link]; Sheldrick et al., 1993[link]). The algorithm used in SHELXS is as follows:

  • (1) A single Patterson peak, v, is selected automatically (or input by the user) and used as a superposition vector. A sharpened Patterson map [with coefficients [(E^{3}F)^{1/2}] instead of [F^{2}], where E is a normalized structure factor] is calculated twice, once with the origin shifted to −v/2 and once with the origin shifted to +v/2. At each grid point, the minimum of the two Patterson function values is stored, and this superposition minimum function is searched for peaks. If a true single-weight heavy atom-to-heavy atom vector has been chosen as the superposition vector, this function should consist ideally of one image of the heavy-atom structure and one inverted image, with two atoms (the ones corresponding to the superposition vector) in common. There are thus about 2N peaks in the map, compared with [N^{2}] in the original Patterson map, a considerable simplification. The only symmetry element of the superposition function is the inversion centre at the origin relating the two images.

  • (2) Possible origin shifts are found so that the full space-group symmetry is obeyed by one of the two images, i.e., for about half the peaks, most of the symmetry equivalents are present in the map. This enables the peaks belonging to the other image to be eliminated and, in principle, solves the heavy-atom substructure. In the space group P1, the double image cannot be resolved in this way.

  • (3) For each plausible origin shift, the potential atoms are displayed as a triangular table that gives the minimum distance and the Patterson superposition minimum function value for all vectors linking each pair of atoms, taking all symmetry equivalents into account. This table enables spurious atoms to be eliminated and occupancies to be estimated, and also in some cases reveals the presence of noncrystallographic symmetry.

  • (4) The whole procedure is then repeated for further superposition vectors as required. The program gives preference to general vectors (multiple vectors will lead to multiple images), and it is advisable to specify a minimum distance of (say) 8 Å for the superposition vector (3.5 Å for selenomethionine MAD data) to increase the chance of finding a true heavy atom-to-heavy atom vector.

25.2.10.3.2. Integrated Patterson and direct methods: SHELXD

| top | pdf |

The program SHELXD (Sheldrick & Gould, 1995[link]; Sheldrick, 1997[link], 1998b[link]) is now part of the SHELX system. It is designed both for the ab initio solution of macromolecular structures from atomic resolution native data alone and for the location of heavy-atom sites from ΔF or FA values at much lower resolution, in particular for the location of larger numbers of anomalous scatterers from MAD data. The dual-space approach of SHELXD was inspired by the Shake and Bake philosophy of Miller et al. (1993[link], 1994[link]) but differs in many details, in particular in the extensive use it makes of the Patterson function that proves very effective in applications involving ΔF or FA data. The ab initio applications of SHELXD have been described in Chapter 16.1[link] , so only the location of heavy atoms will be described here. An advantage of the Patterson function is that it provides a good noise filter for the ΔF or FA data: negative regions of the Patterson function can simply be ignored. On the other hand, the direct-methods approach is efficient at handling a large number of sites, whereas the number of Patterson peaks to analyse increases with the square of the number of atoms. Thus, for reasons of efficiency, the Patterson function is employed at two stages in SHELXD: at the beginning to obtain starting atom positions (otherwise random starting atoms would be employed) and at the end, in the form of the triangular table described above, to recognize which atoms are correct. In between, several cycles of real/reciprocal space alternation are employed as in the ab initio structure solution, alternating between tangent refinement, E-map calculation and peak search, and possibly random omit maps, in which a specified fraction of the potential atoms are left out at random.

25.2.10.3.3. Practical considerations

| top | pdf |

Since the input files for the direct and Patterson methods in SHELXS and the integrated method in SHELXD are almost identical (usually only one instruction needs to be changed), it is easy to try all three methods for difficult problems. The Patterson map interpretation in SHELXS is a good choice if the heavy atoms have variable occupancies and it is not known how many heavy-atom sites need to be found; the direct-methods approaches work best with equal atoms. In general, the conventional direct methods in SHELXS will tend to perform best in a non-polar space group that does not possess special positions; however, for more than about a dozen sites, only the integrated approach in SHELXD is likely to prove effective; the SHELXD algorithm works best when the number of sites is known. Especially for the MAD method, the quality of the data is decisive; it is essential to collect data with a high redundancy to optimize the signal-to-noise ratio and eliminate outliers. In general, a resolution of 3.5 Å is adequate for the location of heavy-atom sites. At the time of writing, SHELXD does not include facilities for the further calculations necessary to obtain maps. Experience indicates that it is only necessary to refine the B values of the heavy atoms using other programs; their coordinates are already rather precise.

Excellent accounts of the theory of direct and Patterson methods with extensive literature references have been presented in IT B Chapter 2.2[link] by Giacovazzo (2001)[link] and Chapter 2.3[link] by Rossmann & Arnold (2001)[link].

25.2.10.4. Macromolecular refinement using SHELXL

| top | pdf |

SHELXL is a very general refinement program that is equally suitable for the refinement of minerals, organometallic structures, oligonucleotides, or proteins (or any mixture thereof) against X-ray or neutron single- (or twinned!) crystal data. It has even been used with diffraction data from powders, fibres and two-dimensional crystals. For refinement against Laue data, it is possible to specify a different wavelength and hence dispersion terms for each reflection. The price of this generality is that it is somewhat slower than programs specifically written only for protein structure refinement. Any protein- (or DNA-)specific information must be input to SHELXL by the user in the form of refinement restraints etc. Refinement of macromolecules using SHELXL has been discussed by Sheldrick & Schneider (1997)[link].

25.2.10.4.1. Constraints and restraints

| top | pdf |

In refining macromolecular structures, it is almost always necessary to supplement the diffraction data with chemical information in the form of restraints. A typical restraint is the condition that a bond length should approximate to a target value with a given estimated standard deviation; restraints are treated as extra experimental data items. Even if the crystal diffracts to 1.0 Å, there may well be poorly defined disordered regions for which restraints are essential to obtain a chemically sensible model (the same can be true of small molecules too!). SHELXL is generally not suitable for refinements at resolutions lower than about 2.5 Å because it cannot handle general potential-energy functions, e.g. for torsion angles or hydrogen bonds; if noncrystallographic symmetry restraints can be employed, this limit can be relaxed a little.

For some purposes (e.g. riding hydrogen atoms, rigid-group refinement, or occupancies of atoms in disordered side chains), constraints, exact conditions that lead to a reduction in the number of variable parameters, may be more appropriate than restraints; SHELXL allows such constraints and restraints to be mixed freely. Riding hydrogen atoms are defined such that the C—H vector remains constant in magnitude and direction, but the carbon atom is free to move; the same shifts are applied to both atoms, and both atoms contribute to the least-squares derivative sums. This model may be combined with anti-bumping restraints that involve hydrogen atoms, which helps to avoid unfavourable side-chain conformations. SHELXL also provides, e.g., methyl groups that can rotate about their local threefold axes; the initial torsion angle may be found using a difference-electron-density synthesis calculated around the circle of possible hydrogen-atom positions.

25.2.10.4.2. Least-squares refinement algebra

| top | pdf |

The original SHELX refinement algorithms were modelled closely on those described by Cruickshank (1970)[link]. For macromolecular refinement, an alternative to (blocked) full-matrix refinement is provided by the conjugate-gradient solution of the least-squares normal equations as described by Hendrickson & Konnert (1980)[link], including preconditioning of the normal matrix that enables positional and displacement parameters to be refined in the same cycle. The structure-factor derivatives contribute only to the diagonal elements of the normal matrix, but all restraints contribute fully to both the diagonal and non-diagonal elements, although neither the Jacobian nor the normal matrix itself are ever generated by SHELXL. The parameter shifts are modified by comparison with those in the previous cycle to accelerate convergence whilst reducing oscillations. Thus, a larger shift is applied to a parameter when the current shift is similar to the previous shift, and a smaller shift is applied when the current and previous shifts have opposite signs.

SHELXL refines against F2 rather than F, which enables all data to be used in the refinement with weights that include contributions from the experimental uncertainties, rather than having to reject F values below a preset threshold; there is a choice of appropriate weighting schemes. Provided that reasonable estimates of σ(F2) are available, this enables more experimental information to be employed in the refinement; it also allows refinement against data from twinned crystals.

25.2.10.4.3. Full-matrix estimates of standard uncertainties

| top | pdf |

Inversion of the full normal matrix (or of large matrix blocks, e.g. for all positional parameters) enables the precision of individual parameters to be estimated (Rollett, 1970[link]), either with or without the inclusion of the restraints in the matrix. The standard uncertainties in dependent quantities (e.g. torsion angles or distances from mean planes) are calculated in SHELXL using the full least-squares correlation matrix. These standard uncertainties reflect the data-to-parameter ratio, i.e. the resolution and completeness of the data and the percentage of solvent, and the quality of the agreement between the observed and calculated F2 values (and the agreement of restrained quantities with their target values when restraints are included).

Full-matrix refinement is also useful when domains are refined as rigid groups in the early stages of refinement (e.g. after structure solution by molecular replacement), since the total number of parameters is small and the correlation between parameters may be large.

25.2.10.4.4. Refinement of anisotropic displacement parameters

| top | pdf |

The motion of macromolecules is clearly anisotropic, but the data-to-parameter ratio rarely permits the refinement of the six independent anisotropic displacement parameters (ADPs) per atom; even for small molecules and data to atomic resolution, the anisotropic refinement of disordered regions requires the use of restraints. SHELXL employs three types of ADP restraint (Sheldrick 1993[link]; Sheldrick & Schneider, 1997[link]). The rigid bond restraint, first suggested by Rollett (1970)[link], assumes that the components of the ADPs of two atoms connected via one (or two) chemical bonds are equal within a specified standard deviation. This has been shown to hold accurately (Hirshfeld, 1976[link]; Trueblood & Dunitz, 1983[link]) for precise structures of small molecules, so it can be applied as a `hard' restraint with small estimated standard deviation. The similar ADP restraint assumes that atoms that are spatially close (but not necessarily bonded, because they may be different components of a disordered group) have similar Uij components. An approximately isotropic restraint is useful for isolated solvent molecules. These two restraints are only approximate and so should be applied with low weights, i.e. high estimated standard deviations.

The transition from isotropic to anisotropic roughly doubles the number of parameters and almost always results in an appreciable reduction in the R factor. However, this represents an improvement in the model only when it is accompanied by a significant reduction in the free R factor (Brünger, 1992b[link]). Since the free R factor is itself subject to uncertainty because of the small sample used, a drop of at least 1% is needed to justify anisotropic refinement. There should also be a reduction in the goodness of fit, and the resulting displacement ellipsoids should make chemical sense and not be `non-positive-definite'!

25.2.10.4.5. Similar geometry and NCS restraints

| top | pdf |

When there are several identical chemical moieties in the asymmetric unit, a very effective restraint is to assume that the chemically equivalent 1,2 and 1,3 distances are the same, but unknown. This technique is easy to apply using SHELXL and is often employed for small-molecule structures and, in particular, for oligosaccharides. Similarly, the terminal P—O bond lengths in DNA structures can be assumed to be the same (but without a target value), i.e. it is assumed that the whole crystal is at the same pH. For proteins, the method is less suitable because of the different abundance of the different amino acids, and, in any case, good target distances are available (Engh & Huber, 1991[link]).

Local noncrystallographic symmetry (NCS) restraints (Usón et al., 1999[link]) may be applied to restrain corresponding 1,4 distances and isotropic displacement parameters to be the same when there are several identical macromolecular domains in the asymmetric unit; usually, the 1,2 and 1,3 distances are restrained to standard values in such cases and so do not require NCS restraints. Such local NCS restraints are more flexible than global NCS constraints and – unlike the latter – do not require the specification of a transformation matrix and mask.

25.2.10.4.6. Modelling disorder and solvent

| top | pdf |

There are many ways of modelling disorder using SHELXL, but for macromolecules the most convenient is to retain the same atom and residue names for the two or more components and assign a different `part number' (analogous to the PDB alternative site flag) to each component. With this technique, no change is required to the input restraints etc. Atoms in the same component will normally have a common occupancy that is assigned to a `free variable'. If there are only two components, the sum of their occupancies can be constrained to be unity; if there are more than two components, the sum of their free variables may be restrained to be unity. Since any linear restraint may be applied to the free variables, they are very flexible, e.g. for modelling complicated disorder. By restraining distances to be equal to a free variable, a standard deviation of the mean distance may be calculated rigorously using full-matrix least-squares algebra.

Babinet's principle is used to define a bulk solvent model with two refinable parameters (Moews & Kretsinger, 1975[link]), and global anisotropic scaling (Usón et al., 1999[link]) may be applied using a parameterization proposed by Parkin et al. (1995)[link]. An auxiliary program, SHELXWAT, allows automatic water divining by iterative least-squares refinement, rejection of waters with high displacement parameters, difference-electron-density calculation, and a peak search for potential water molecules that make at least one good hydrogen bond and no bad contacts; this is a simplified version of the ARP procedure of Lamzin & Wilson (1993)[link].

25.2.10.4.7. Twinned crystals

| top | pdf |

SHELXL provides facilities for refining against data from merohedral, pseudo-merohedral and non-merohedral twins (Herbst-Irmer & Sheldrick, 1998[link]). Refinement against data from merohedrally twinned crystals is particularly straightforward, requiring only the twin law (a 3 × 3 matrix) and starting values for the volume fractions of the twin components. Failure to recognize such twinning not only results in high R factors and poor quality maps, it can also lead to incorrect biochemical conclusions (Luecke et al., 1998[link]). Twinning can often be detected by statistical tests (Yeates & Fam, 1999[link]), and it is probably much more widespread in macromolecular crystals than is generally appreciated!

25.2.10.4.8. The radius of convergence

| top | pdf |

Least-squares refinement as implemented in SHELXL and other programs is appropriate for structural models that are relatively complete, but when an appreciable fraction of the structure is still to be located, maximum-likelihood refinement (Bricogne, 1991[link]; Pannu & Read, 1996a[link]; Murshudov et al., 1997[link]) is likely to be more effective, especially when experimental phase information can be incorporated (Pannu et al., 1998[link]). Within the least-squares framework, there are still several possible ways of improving the radius of convergence. SHELXL provides the option of gradually extending the resolution of the data during the refinement; a similar effect may be achieved by a resolution-dependent weighting scheme (Terwilliger & Berendzen, 1996[link]). Unimodal restraints, such as target distances, are less likely to result in local minima than are multimodal restraints, such as torsion angles; multimodal functions are better used as validation criteria. It is fortunate that validation programs, such as PROCHECK (Laskowski et al., 1993[link]), make good use of multimodal functions such as torsion angles and hydrogen-bonding patterns that are not employed as restraints in SHELXL refinements.

25.2.10.5. SHELXPRO – protein interface to SHELX

| top | pdf |

The SHELX system includes several auxiliary programs, the most important of which for macromolecular users is SHELXPRO. SHELXPRO provides an interface between SHELXS, SHELXL and other programs commonly used by protein crystallographers, particularly graphics programs; for example, it can write map files for O (Jones et al., 1991[link]) or (Turbo)Frodo (Jones, 1978[link]). For XtalView (McRee, 1992[link]), this is not necessary, because XtalView can read the CIF format reflection data files written by SHELXL directly, and XtalView is generally the interactive macromolecular graphics program of choice for use with SHELX because it can interpret and display anisotropic displacement parameters and multiple conformations.

Often, SHELXL will be used only for the final stages of refinement, in which case SHELXPRO is used to generate the name.ins file from a PDB format file, inserting the necessary restraints and other instructions. The geometric restraints for standard amino acids are based on those of Engh & Huber (1991)[link]. SHELXPRO is also used to prepare the name.ins file for a new refinement job based on the results of the previous refinement (possibly modified by an interactive graphics program such as XtalView) and to prepare data for PDB deposition. In addition, the refinement results can be summarized graphically in the form of PostScript plots.

25.2.10.6. Distribution and support of SHELX

| top | pdf |

The SHELX system is available free to academics and, for a small licence fee, to commercial users. The programs are supplied as Fortran77 sources and as precompiled versions for Linux and some other widely used operating systems. The programs, examples and extensive documentation may be downloaded by ftp or (if necessary) supplied on CD ROM. Details of new developments, answers to frequently asked questions, and information about obtaining and installing the programs are available from the SHELX homepage, http://shelx.uni-ac.gwdg.de/SHELX/ . The author is always interested to receive reports of problems and suggestions for improving the programs and their documentation by e-mail (gsheldr@shelx.uni-ac.gwdg.de ).

Acknowledgements

KDC (Section 25.2.2)[link] acknowledges the support of the UK BBSRC (grant No. 87/B03785). KYJZ (Section 25.2.2)[link] acknowledges the National Institutes of Health for grant support (GM55663).

For Section 25.2.3[link], support by the Howard Hughes Medical Institute and the National Science Foundation to ATB (DBI-9514819 and ASC 93-181159), the Natural Sciences and Engineering Research Council of Canada to NSP, the Howard Hughes Medical Institute and the Medical Research Council of Canada to RJR (MT11000), the Netherlands Foundation for Chemical Research (SON–NWO) to PG and the Howard Hughes Medical Institute to LMR is gratefully acknowledged.

Significant contributors to the programs in the PROCHECK suite (Section 25.2.6)[link] include David K. Smith, E. Gail Hutchinson, David T. Jones, J. Antoon C. Rullmann, A. Louise Morris and Dorica Naylor. Part of the development work was funded by a grant from the EU Framework IV Biotechnology programme, contract CT96–0189.

Development of the programs described in Section 25.2.8[link] for research use was supported by NIH grant GM 15000 and by an educational leave from Glaxo Wellcome Inc. for J. Michael Word; development for teaching use was supported by NSF grant DUE-9980935.

References

Abrahams, J. P. (1993). Compression of X-ray images. Jt CCP4 ESF–EACBM Newsl. Protein Crystallogr. 28, 3–4.Google Scholar
Abrahams, J. P. (1996). Likelihood-weighted real space restraints for refinement at low resolution. In Proceedings of the CCP4 study weekend. Macromolecular refinement, edited by E. Dodson, M. Moore, A. Ralph & S. Bailey. Warrington: Daresbury Laboratory.Google Scholar
Abrahams, J. P. (1997). Bias reduction in phase refinement by modified interference functions: introducing the γ correction. Acta Cryst. D53, 371–376.Google Scholar
Abrahams, J. P. & Leslie, A. G. W. (1996). Methods used in the structure determination of bovine mitochondrial F2 ATPase. Acta Cryst. D52, 30–42.Google Scholar
Adams, P. D., Pannu, N. S., Read, R. J. & Brünger, A. T. (1997). Cross-validated maximum likelihood enhances crystallographic simulated annealing refinement. Proc. Natl Acad. Sci. USA, 94, 5018–5023.Google Scholar
Adobe Systems Inc. (1985). PostScript language reference manual. Reading, MA: Addison-Wesley.Google Scholar
Agarwal, R. C. (1978). A new least-squares technique based on the fast Fourier transform algorithm. Acta Cryst. A34, 791–809.Google Scholar
Agarwal, R. C., Lifchitz, A. & Dodson, E. (1981). Block diagonal least squares refinement using fast Fourier techniques. In Refinement of protein structures, edited by P. A. Machin, J. W. Campbell & M. Elder. Warrington: Daresbury Laboratory.Google Scholar
Allen, F. H., Bellard, S., Brice, M. D., Cartwright, B. A., Doubleday, A., Higgs, H., Hummelink, T., Hummelink-Peters, B. G., Kennard, O., Motherwell, W. D. S., Rodgers, J. R. & Watson, D. G. (1979). The Cambridge Crystallographic Data Centre: computer-based search, retrieval, analysis and display of information. Acta Cryst. B35, 2331–2339.Google Scholar
Axelsson, O. & Barker, V. (1984). Finite element solution of boundary value problems, ch. 1, pp. 1–63. Orlando: Academic Press.Google Scholar
Bateman, R. C. (2000). Undergraduate kinemage authorship homepage. http://orca.st.usm.edu/∼rbateman/kinemage/ .Google Scholar
Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F. Jr, Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol. 112, 535–542.Google Scholar
Blow, D. M. & Crick, F. H. C. (1959). The treatment of errors in the isomorphous replacement method. Acta Cryst. 12, 794–802.Google Scholar
Blundell, T. L. & Johnson, L. N. (1976). Protein crystallography, pp. 375–377. London: Academic Press.Google Scholar
Bolin, J. T., Smith, J. L. & Muchmore, S. W. (1993). Considerations in phase refinement and extension: experiments with a rapid and automatic procedure. Am. Crystallogr. Assoc. Meet. Abstracts, Vol. 21, V001, 51.Google Scholar
Bricogne, G. (1974). Geometric sources of redundancy in intensity data and their use for phase determination. Acta Cryst. A30, 395–405.Google Scholar
Bricogne, G. (1976). Methods and programs for direct-space exploitation of geometric redundancies. Acta Cryst. A32, 832–846.Google Scholar
Bricogne, G. (1984). Maximum entropy and the foundations of direct methods. Acta Cryst. A40, 410–445.Google Scholar
Bricogne, G. (1991). A multisolution method of phase determination by combined maximization of entropy and likelihood. III. Extension to powder diffraction data. Acta Cryst. A47, 803–829.Google Scholar
Bricogne, G. & Irwin, J. J. (1996). Maximum-likelihood refinement of incomplete models with BUSTER–TNT. In Proceedings of the macromolecular crystallographic computing school, edited by P. Bourne & K. Watenpaugh. http://www.iucr.org/iucr-top/comm/ccom/School96/pdf/gb1.pdf .Google Scholar
Brünger, A. T. (1988). Crystallographic refinement by simulated annealing: application to a 2.8 Å resolution structure of aspartate aminotransferase. J. Mol. Biol. 203, 803–816.Google Scholar
Brünger, A. T. (1992a). X-PLOR. Version 3.1. A system for X-ray crystallography and NMR. Yale University Press, New Haven.Google Scholar
Brünger, A. T. (1992b). Free R value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature (London), 355, 472–475.Google Scholar
Brünger, A. T., Adams, P. D., Clore, G. M., DeLano, W. L., Gros, P., Grosse-Kunstleve, R. W., Jiang, J.-S., Kuszewski, J., Nilges, M., Pannu, N. S., Read, R. J., Rice, L. M., Simonson, T. & Warren, G. L. (1998). Crystallography & NMR System (CNS): a new software suite for macromolecular structure determination. Acta Cryst. D54, 905–921.Google Scholar
Brünger, A. T., Adams, P. D. & Rice, L. M. (1997). New applications of simulated annealing in X-ray crystallography and solution NMR. Structure, 5, 325–336. Google Scholar
Brünger, A. T., Karplus, M. & Petsko, G. A. (1989). Crystallographic refinement by simulated annealing: application to crambin. Acta Cryst. A45, 50–61.Google Scholar
Brünger, A. T., Krukowski, A. & Erickson, J. W. (1990). Slow-cooling protocols for crystallographic refinement by simulated annealing. Acta Cryst. A46, 585–593.Google Scholar
Brünger, A. T., Kuriyan, J. & Karplus, M. (1987). Crystallographic R factor refinement by molecular dynamics. Science, 235, 458–460.Google Scholar
Buerger, M. J. (1959). Vector space and its application in crystal structure investigation. New York: Wiley.Google Scholar
Buerger, M. J. (1964). Image methods in crystal structure analysis. In Advanced methods of crystallography, edited by G. N. Ramachandran, pp. 1–24. Orlando, Florida: Academic Press.Google Scholar
Burling, F. T. & Brünger, A. T. (1994). Thermal motion and conformational disorder in protein crystal structures: comparison of multi-conformer and time-averaging models. Isr. J. Chem. 34, 165–175.Google Scholar
Burling, F. T., Weis, W. I., Flaherty, K. M. & Brünger, A. T. (1996). Direct observation of protein solvation and discrete disorder with experimental crystallographic phases. Science, 271, 72–77.Google Scholar
Busetta, B., Giacovazzo, C., Burla, M. C., Nunzi, A., Polidori, G. & Viterbo, D. (1980). The SIR program. I. Use of negative quartets. Acta Cryst. A36, 68–74.Google Scholar
Chapman, M. S., Tsao, J. & Rossmann, M. G. (1992). Ab initio phase determination for spherical viruses: parameter determination for spherical-shell models. Acta Cryst. A48, 301–312.Google Scholar
Collaborative Computational Project, Number 4 (1994). The CCP4 suite: programs for protein crystallography. Acta Cryst. D50, 760–763.Google Scholar
Cowtan, K. D. & Main, P. (1996). Phase combination and cross validation in iterated density-modification calculations. Acta Cryst. D52, 43–48.Google Scholar
Cowtan, K. D. & Main, P. (1998). Miscellaneous algorithms for density modification. Acta Cryst. D54, 487–493.Google Scholar
Cruickshank, D. W. J. (1970). Least-squares refinement of atomic parameters. In Crystallographic computing, edited by F. R. Ahmed, S. R. Hall & C. P. Huber, pp. 187–197. Copenhagen: Munksgaard.Google Scholar
Dauter, Z., Lamzin, V. S. & Wilson, K. S. (1997). The benefits of atomic resolution. Curr. Opin. Struct. Biol. 7, 681–688.Google Scholar
Debaerdemaeker, T., Tate, C. & Woolfson, M. M. (1985). On the application of phase relationships to complex structures. XXIV. The Sayre tangent formula. Acta Cryst. A41, 286–290.Google Scholar
Diederichs, K. & Karplus, P. A. (1997). Improved R-factors for diffraction data analysis in macromolecular crystallography. Nature Struct. Biol. 4, 269–274.Google Scholar
Eisenberg, D., Lüthy, R. & Bowie, J. U. (1997). VERIFY3D: assessment of protein models with three-dimensional profiles. Methods Enzymol. 277, 396–404.Google Scholar
Engh, R. A. & Huber, R. (1991). Accurate bond and angle parameters for X-ray protein structure refinement. Acta Cryst. A47, 392–400.Google Scholar
French, S. & Wilson, K. (1978). On the treatment of negative intensity observations. Acta Cryst. A34, 517–525.Google Scholar
Furey, W. & Swaminathan, S. (1990). PHASES – a program package for the processing and analysis of diffraction data from macromolecules. Am. Crystallogr. Assoc. Meet. Abstracts, Vol. 18, PA33, 73.Google Scholar
Furey, W. & Swaminathan, S. (1997). PHASES-95: a program package for processing and analyzing diffraction data from macromolecules. Methods Enzymol. 277, 590–620.Google Scholar
Giacovazzo, C. (2001). Direct methods. In International tables for crystallography, Vol. B. Reciprocal space, edited by U. Shmueli, ch. 2.2. Dordrecht: Kluwer Academic Publishers.Google Scholar
Graham, I. S. (1995). The HTML sourcebook. John Wiley and Sons.Google Scholar
Green, D. W., Ingram, V. M. & Perutz, M. F. (1954). The structure of haemoglobin. IV. Sign determination by the isomorphous replacement method. Proc. R. Soc. London Ser. A, 225, 287–307.Google Scholar
Greer, J. (1974). Three-dimensional pattern recognition: an approach to automated interpretation of electron density maps of proteins. J. Mol. Biol. 82, 279–301.Google Scholar
Hendrickson, W. A. (1979). Phase information from anomalous-scattering measurements. Acta Cryst. A35, 245–247.Google Scholar
Hendrickson, W. A. (1991). Determination of macromolecular structures from anomalous diffraction of synchrotron radiation. Science, 254, 51–58.Google Scholar
Hendrickson, W. A. & Konnert, J. H. (1980). Incorporation of stereochemical information into crystallographic refinement. In Computing in crystallography, edited by R. Diamond, S. Ramaseshan & K. Venkatesan, pp. 13.01–13.23. Bangalore: Indian Academy of Sciences.Google Scholar
Hendrickson, W. A. & Lattman, E. E. (1970). Representation of phase probability distributions for simplified combination of independent phase information. Acta Cryst. B26, 136–143.Google Scholar
Herbst-Irmer, R. & Sheldrick, G. M. (1998). Refinement of twinned structures with SHELXL97. Acta Cryst. B54, 443–449.Google Scholar
Hirshfeld, F. L. (1976). Can X-ray data distinguish bonding effects from vibrational smearing? Acta Cryst. A32, 239–244.Google Scholar
Holmes, M. A. & Matthews, B. W. (1981). Binding of hydroxamic acid inhibitors to crystalline thermolysin suggests a pentacoordinate zinc intermediate in catalysis. Biochemistry, 20, 6912–6920.Google Scholar
Hooft, R. W. W., Sander, C., Vriend, G. & Abola, E. E. (1996). Errors in protein structures. Nature (London), 381, 272.Google Scholar
IUPAC–IUB Commission on Biochemical Nomenclature (1970). Abbreviations and symbols for the description of the conformation of polypeptide chains. J. Mol. Biol. 52, 1–17. Google Scholar
Jack, A. & Levitt, M. (1978). Refinement of large structures by simultaneous minimization of energy and R factor. Acta Cryst. A34, 931–935.Google Scholar
Jiang, J.-S. & Brünger, A. T. (1994). Protein hydration observed by X-ray diffraction: solvation properties of penicillopepsin and neuraminidase crystal structures. J. Mol. Biol. 243, 100–115.Google Scholar
Jones, T. A. (1978). A graphics model building and refinement system for macromolecules. J. Appl. Cryst. 11, 268–272.Google Scholar
Jones, T. A., Zou, J.-Y., Cowan, S. W. & Kjeldgaard, M. (1991). Improved methods for building protein models in electron density maps and the location of errors in these models. Acta Cryst. A47, 110–119.Google Scholar
Kabsch, W. (1976). A solution for the best rotation to relate two sets of vectors. Acta Cryst. A32, 922–923.Google Scholar
Kabsch, W. (1988a). Automatic indexing of rotation diffraction patterns. J. Appl. Cryst. 21, 67–72.Google Scholar
Kabsch, W. (1988b). Evaluation of single-crystal X-ray diffraction data from a position-sensitive detector. J. Appl. Cryst. 21, 916–924.Google Scholar
Kabsch, W. (1993). Automatic processing of rotation diffraction data from crystals of initially unknown symmetry and cell constants. J. Appl. Cryst. 26, 795–800.Google Scholar
Karle, J. & Hauptman, H. (1956). A theory of phase determination for the four types of non-centrosymmetric space groups 1P222, 2P22, 3P12, 3P22. Acta Cryst. 9, 635–651.Google Scholar
Kleywegt, G. J. & Brünger, A. T. (1996). Checking your imagination: applications of the free R value. Structure, 4, 897–904.Google Scholar
Kleywegt, G. J. & Jones, T. A. (1996a). Phi/psi-chology: Ramachandran revisited. Structure, 4, 1395–1400.Google Scholar
Kleywegt, G. J. & Jones, T. A. (1996b). Efficient rebuilding of protein structures. Acta Cryst. D52, 829–832.Google Scholar
Kraulis, P. J. (1991). MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. J. Appl. Cryst. 24, 946–950.Google Scholar
Lamzin, V. S. & Wilson, K. S. (1993). Automated refinement of protein models. Acta Cryst. D49, 129–147.Google Scholar
Lamzin, V. S. & Wilson, K. S. (1997). Automated refinement for protein crystallography. Methods Enzymol. 277, 269–305.Google Scholar
Laskowski, R. A., MacArthur, M. W., Moss, D. S. & Thornton, J. M. (1993). PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Cryst. 26, 283–291.Google Scholar
Laskowski, R. A., MacArthur, M. W. & Thornton, J. M. (1998). Validation of protein models derived from experiment. Curr. Opin. Struct. Biol. 8, 631–639.Google Scholar
Laskowski, R. A., Rullmann, J. A. C., MacArthur, M. W., Kaptein, R. & Thornton, J. M. (1996). AQUA and PROCHECK-NMR: programs for checking the quality of protein structures solved by NMR. J. Biomol. Nucl. Magn. Reson. 8, 477–486.Google Scholar
Leslie, A. G. W. (1987). A reciprocal-space method for calculating a molecular envelope using the algorithm of B. C. Wang. Acta Cryst. A43, 134–136.Google Scholar
Lewis, M. & Rees, D. C. (1983). Statistical modification of anomalous scattering differences. Acta Cryst. A39, 512–515.Google Scholar
Lovell, S. C., Word, J. M., Richardson, J. S. & Richardson, D. C. (2000). The penultimate rotamer library. Proteins Struct. Funct. Genet. 40, 389–408.Google Scholar
Luecke, H., Richter, H. T. & Lanyi, J. K. (1998). Proton transfer pathways in bacteriorhodopsin at 2.3 Å resolution. Science, 280, 1934–1937.Google Scholar
MacArthur, M. W., Laskowski, R. A. & Thornton, J. M. (1994). Knowledge-based validation of protein structure coordinates derived by X-ray crystallography and NMR spectroscopy. Curr. Opin. Struct. Biol. 4, 731–737.Google Scholar
McRee, D. E. (1992). A visual protein crystallographic software system for X11/Xview. J. Mol. Graphics, 10, 44–46.Google Scholar
McRee, D. E. (1993). Practical protein crystallography. San Diego: Academic Press.Google Scholar
McRee, D. E. (1999). XtalView/Xfit – a versatile program for manipulating atomic coordinates and electron density. J. Struct. Biol. 125, 156–165.Google Scholar
Matthews, B. W. (1968). Solvent content of protein crystals. J. Mol. Biol. 33, 491–497.Google Scholar
Matthews, B. W. (1974). Determination of molecular weight from protein crystals. J. Mol. Biol. 82, 513–526.Google Scholar
Merritt, E. A. (2000). Raster3D (photorealistic molecular graphics). http://www.bmsc.washington.edu/raster3d .Google Scholar
Merritt, E. A. & Bacon, D. J. (1997). Raster3D: photorealistic molecular graphics. Methods Enzymol. 277, 505–525.Google Scholar
Miller, R., DeTitta, G. T., Jones, R., Langs, D. A., Weeks, C. M. & Hauptman, H. A. (1993). On the application of the minimal principle to solve unknown structures. Science, 259, 1430–1433.Google Scholar
Miller, R., Gallo, S. M., Khalak, H. G. & Weeks, C. M. (1994). SnB: crystal structure determination via Shake-and-Bake. J. Appl. Cryst. 27, 613–621.Google Scholar
Moews, P. C. & Kretsinger, R. H. (1975). Refinement of carp muscle parvalbumin by model building and difference Fourier analysis. J. Mol. Biol. 91, 201–228.Google Scholar
Morris, A. L., MacArthur, M. W., Hutchinson, E. G. & Thornton, J. M. (1992). Stereochemical quality of protein structure coordinates. Proteins, 12, 345–364.Google Scholar
MSI (1997). QUANTA. MSI, 9685 Scranton Road, San Diego, CA 92121–3752, USA. Google Scholar
Murshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). Refinement of macromolecular structures by the maximum-likelihood method. Acta Cryst. D53, 240–255.Google Scholar
Navaza, J. (1994). AMoRe: an automated package for molecular replacement. Acta Cryst. A50, 157–163.Google Scholar
Oldfield, T. J. (1992). SQUID: a program for the analysis and display of data from crystallography and molecular dynamics. J. Mol. Graphics, 10, 247–252.Google Scholar
Otwinowski, Z. (1991). In Proceedings of the CCP4 study weekend. Isomorphous replacement and anomalous scattering, edited by W. Wolf, P. R. Evans & A. G. W. Leslie, pp. 80–86. Warrington: Daresbury Laboratory.Google Scholar
Pannu, N. S., Murshudov, G. N., Dodson, E. J. & Read, R. J. (1998). Incorporation of prior phase information strengthens maximum-likelihood structure refinement. Acta Cryst. D54, 1285–1294.Google Scholar
Pannu, N. S. & Read, R. J. (1996a). Improved structure refinement through maximum likelihood. Acta Cryst. A52, 659–668.Google Scholar
Pannu, N. S. & Read, R. J. (1996b). Improved structure refinement through maximum likelihood. In Proceedings of the CCP4 study weekend. Macromolecular refinement, edited by E. Dodson, M. Moore, A. Ralph & S. Bailey. Warrington: Daresbury Laboratory.Google Scholar
Parkin, S., Moezzi, B. & Hope, H. (1995). XABS2: an empirical absorption correction program. J. Appl. Cryst. 28, 53–56.Google Scholar
Pepinsky, R. & Okaya, Y. (1956). Determination of crystal structures by means of anomalously scattered X-rays. Proc. Natl Acad. Sci. USA, 42, 286–292.Google Scholar
Perrakis, A., Morris, R. & Lamzin, V. S. (1999). Automated protein model building combined with iterative structure refinement. Nature Struct. Biol. 6, 458–463.Google Scholar
Perrakis, A., Sixma, T. K., Wilson, K. S. & Lamzin, V. S. (1997). wARP: improvement and extension of crystallographic phases by weighted averaging of multiple-refined dummy atomic models. Acta Cryst. D53, 448–455.Google Scholar
Perrakis, A., Tews, I., Dauter, Z., Oppenheim, A., Chet, I., Wilson, K. S. & Vorgias, C. E. (1994). Structure of a bacterial chitinase at 2.3 Å resolution. Structure, 2, 1169–1180.Google Scholar
Pontius, J., Richelle, J. & Wodak, S. (1996). Deviations from standard atomic volumes as a quality measure for protein crystal structures. J. Mol. Biol. 264, 121–136.Google Scholar
POV-Ray Team (2000). POV-Ray – the persistence of Vision Raytracer. http://www.povray.org .Google Scholar
Ramachandran, G. N., Ramakrishnan, C. & Sasisekharan, V. (1963). Stereochemistry of polypeptide chain configurations. J. Mol. Biol. 7, 95–99.Google Scholar
Ramakrishnan, C. & Ramachandran, G. N. (1965). Stereochemical criteria for polypeptide and protein chain conformations. II. Allowed conformations for a pair of peptide units. Biophys. J. 5, 909–933.Google Scholar
Read, R. J. (1986). Improved Fourier coefficients for maps using phases from partial structures with errors. Acta Cryst. A42, 140–149.Google Scholar
Read, R. J. (1990). Structure-factor probabilities for related structures. Acta Cryst. A46, 900–912.Google Scholar
Read, R. J. (1994). Maximum likelihood refinement of heavy atoms. Lecture notes for a workshop on isomorphous replacement methods in macromolecular crystallography. American Crystallographic Association Annual Meeting, 1994, Atlanta, GA, USA.Google Scholar
Read, R. J. (1997). Model phases: probabilities and bias. Methods Enzymol. 277, 110–128.Google Scholar
Research Collaboratory for Structural Bioinformatics (2000). The RCSB Protein Data Bank. http://www.rcsb.org/pdb .Google Scholar
Rice, L. M. & Brünger, A. T. (1994). Torsion angle dynamics: reduced variable conformational sampling enhances crystallographic structure refinement. Proteins Struct. Funct. Genet. 19, 277–290.Google Scholar
Richardson, D. C. & Richardson, J. S. (1992). The kinemage: a tool for scientific illustration. Protein Sci. 1, 3–9.Google Scholar
Richardson, D. C. & Richardson, J. S. (1994). Kinemages – simple macromolecular graphics for interactive teaching and publication. Trends Biochem. Sci. 19, 135–138.Google Scholar
Richardson, J. W. & Jacobson, R. A. (1987). Computer-aided analysis of multi-solution Patterson superpositions. In Patterson and Pattersons, edited by J. P. Glusker, B. Patterson & M. Rossi, pp. 311–317. Oxford: IUCr and Oxford University Press.Google Scholar
Richardson Laboratory (2000). The Richardson's 3-D protein structure homepage. http://kinemage.biochem.duke.edu (or ftp://kinemage.biochem.duke.edu ).Google Scholar
Rollett, J. S. (1970). Least-squares procedures in crystal structure analysis. In Crystallographic computing, edited by F. R. Ahmed, S. R. Hall & C. P. Huber, pp. 167–181. Copenhagen: Munksgaard.Google Scholar
Rossmann, M. G. & Arnold, E. (2001). Patterson and molecular-replacement techniques. In International tables for crystallography, Vol. B. Reciprocal space, edited by U. Shmueli, ch. 2.3. Dordrecht: Kluwer Academic Publishers.Google Scholar
Rossmann, M. G. & Blow, D. M. (1963). Determination of phases by the conditions of non-crystallographic symmetry. Acta Cryst. 16, 39–44.Google Scholar
Rossmann, M. G., McKenna, R., Tong, L., Xia, D., Dai, J.-B., Wu, H., Choi, H.-K. & Lynch, R. E. (1992). Molecular replacement real-space averaging. J. Appl. Cryst. 25, 166–180.Google Scholar
Sack, J. S. (1988). CHAIN – a crystallographic modeling program. J. Mol. Graphics, 6, 224–225.Google Scholar
Schuller, D. J. (1996). MAGICSQUASH: more versatile non-crystallographic averaging with multiple constraints. Acta Cryst. D52, 425–434.Google Scholar
Sheldrick, G. M. (1985). Computing aspects of crystal structure determination. J. Mol. Struct. 130, 9–16.Google Scholar
Sheldrick, G. M. (1990). Phase annealing in SHELX-90: direct methods for larger structures. Acta Cryst. A46, 467–473.Google Scholar
Sheldrick, G. M. (1991). Tutorial on automated Patterson interpretation to find heavy atoms. In Crystallographic computing 5. From chemistry to biology, edited by D. Moras, A. D. Podjarny & J. C. Thierry, pp. 145–157. Oxford: IUCr and Oxford University Press.Google Scholar
Sheldrick, G. M. (1993). Refinement of large small-molecule structures using SHELXL-92. In Crystallographic computing 6. A window on modern crystallography, edited by H. D. Flack, L. Párkányi & K. Simon, pp. 111–122. Oxford: IUCr and Oxford University Press.Google Scholar
Sheldrick, G. M. (1997). Direct methods based on real/reciprocal space iteration. In Proceedings of the CCP4 study weekend. Recent advances in phasing, edited by K. S. Wilson, G. Davies, A. W. Ashton & S. Bailey, pp. 147–157. Warrington: Daresbury Laboratory.Google Scholar
Sheldrick, G. M. (1998a). Location of heavy atoms by automated Patterson interpretation. In Direct methods for solving macromolecular structures, edited by S. Fortier, pp. 131–141. Dordrecht: Kluwer Academic Publishers.Google Scholar
Sheldrick, G. M. (1998b). SHELX: applications to macromolecules. In Direct methods for solving macromolecular structures, edited by S. Fortier, pp. 401–411. Dordrecht: Kluwer Academic Publishers.Google Scholar
Sheldrick, G. M., Dauter, Z., Wilson, K. S., Hope, H. & Sieker, L. C. (1993). The application of direct methods and Patterson interpretation to high-resolution native protein data. Acta Cryst. D49, 18–23.Google Scholar
Sheldrick, G. M. & Gould, R. O. (1995). Structure solution by iterative peaklist optimization and tangent expansion in space group P1. Acta Cryst. B51, 423–431.Google Scholar
Sheldrick, G. M. & Schneider, T. R. (1997). SHELXL: high resolution refinement. Methods Enzymol. 277, 319–343.Google Scholar
Sim, G. A. (1959). The distribution of phase angles for structures containing heavy atoms. II. A modification of the normal heavy-atom method for non-centrosymmetrical structures. Acta Cryst. 12, 813–814.Google Scholar
Stout, G. H. & Jensen, L. H. (1989). X-ray structure determination, p. 33. New York: Wiley Interscience.Google Scholar
Sussman, J. L., Holbrook, S. R., Church, G. M. & Kim, S.-H. (1977). A structure-factor least-squares refinement procedure for macromolecular structures using constrained and restrained parameters. Acta Cryst. A33, 800–804.Google Scholar
Ten Eyck, L. F. (1973). Crystallographic fast Fourier transforms. Acta Cryst. A29, 183–191.Google Scholar
Ten Eyck, L. F. (1977). Efficient structure-factor calculation for large molecules by the fast Fourier transform. Acta Cryst. A33, 486–492.Google Scholar
Terwilliger, T. C. & Berendzen, J. (1996). Bayesian weighting for macromolecular crystallographic refinement. Acta Cryst. D52, 743–748.Google Scholar
Terwilliger, T. C. & Eisenberg, D. (1987a). Isomorphous replacement: effects of errors on the phase probability distribution. Acta Cryst. A43, 6–13.Google Scholar
Terwilliger, T. C. & Eisenberg, D. (1987b). Isomorphous replacement: effects of errors on the phase probability distribution. Erratum. Acta Cryst. A43, 286.Google Scholar
Tickle, I. J., Laskowski, R. A. & Moss, D. S. (1998). Error estimates of protein structure coordinates and deviations from standard geometry by full-matrix refinement of γB- and γB2-crystallin. Acta Cryst. D54, 243–252.Google Scholar
Tronrud, D. E. (1992). Conjugate-direction minimization: an improved method for the refinement of macromolecules. Acta Cryst. A48, 912–916.Google Scholar
Tronrud, D. E. (1996). Knowledge-based B-factor restraints for the refinement of proteins. J. Appl Cryst. 29, 100–104.Google Scholar
Tronrud, D. E. (1997). TNT refinement package. Methods Enzymol. 277, 306–319.Google Scholar
Tronrud, D. E., Ten Eyck, L. F. & Matthews, B. W. (1987). An efficient general-purpose least-squares refinement program for macromolecular structures. Acta Cryst. A43, 489–501.Google Scholar
Trueblood, K. N. & Dunitz, J. D. (1983). Internal molecular motions in crystals. The estimation of force constants, frequencies and barriers from diffraction data. A feasibility study. Acta Cryst. B39, 120–133.Google Scholar
Tsao, J., Chapman, M. S. & Rossmann, M. G. (1992). Ab initio phase determination for viruses with high symmetry: a feasibility study. Acta Cryst. A48, 293–301.Google Scholar
Usón, I., Pohl, E., Schneider, T. R., Dauter, Z., Schmidt, A., Fritz, H.-J. & Sheldrick, G. M. (1999). 1.7 Å structure of the stabilised RE!v mutant T39K. Application of local NCS restraints. Acta Cryst. D55, 1158–1167.Google Scholar
Vellieux, F. M. D. A. P., Hunt, J. F., Roy, S. & Read, R. J. (1995). DEMON/ANGEL: a suite of programs to carry out density modification. J. Appl. Cryst. 28, 347–351.Google Scholar
Walther, D. & Cohen, F. E. (1999). Conformational attractors on the Ramachandran map. Acta Cryst. D55, 506–517.Google Scholar
Wang, B. C. (1985). Resolution of phase ambiguity in macromolecular crystallography. Methods Enzymol. 115, 90–112.Google Scholar
Watenpaugh, K. D., Sieker, L. C., Herriott, J. R. & Jensen, L. H. (1973). Refinement of the model of a protein: rubredoxin at 1.5 Å resolution. Acta Cryst. B29, 943–956.Google Scholar
Weis, W. I., Brünger, A. T., Skehel, J. J. & Wiley, D. C. (1990). Refinement of the influenza virus haemagglutinin by simulated annealing. J. Mol. Biol. 212, 737–761.Google Scholar
Wilson, A. J. C. (1949). The probability distribution of X-ray intensities. Acta Cryst. 2, 318–321.Google Scholar
Wilson, K. S., Butterworth, S., Dauter, Z., Lamzin, V. S., Walsh, M., Wodak, S., Pontius, J., Richelle, J., Vaguine, A., Sander, C., Hooft, R. W. W., Vriend, G., Thornton, J. M., Laskowski, R. A., MacArthur, M. W., Dodson, E. J., Murshudov, G., Oldfield, T. J., Kaptein, R. & Rullmann, J. A. C. (1998). Who checks the checkers? Four validation tools applied to eight atomic resolution structures. J. Mol. Biol. 276, 417–436.Google Scholar
Word, J. M., Bateman, R. C., Presley, B. K., Lovell, S. C. & Richardson, D. C. (2000). Exploring steric constraints on protein mutations using MAGE/PROBE. Protein Sci. 11, 2251–2259.Google Scholar
Word, J. M., Lovell, S. C., LaBean, T. H., Taylor, H. C., Zalis, M. E., Presley, B. K., Richardson, J. S. & Richardson, D. C. (1999). Visualizing and quantifying molecular goodness-of-fit: small-probe contact dots with explicit hydrogen atoms. J. Mol. Biol. 285, 1711–1733.Google Scholar
Word, J. M., Lovell, S. C., Richardson, J. S. & Richardson, D. C. (1999). Asparagine and glutamine: using hydrogen atom contacts in the choice of side-chain amide orientation. J. Mol. Biol. 285, 1735–1747.Google Scholar
Yeates, T. O. & Fam, B. C. (1999). Protein crystals and their evil twins. Structure, 7, R25–R29.Google Scholar
Zhang, K. Y. J. & Main, P. (1990a). Histogram matching as a new density modification technique for phase refinement and extension of protein molecules. Acta Cryst. A46, 41–46.Google Scholar
Zhang, K. Y. J. & Main, P. (1990b). The use of Sayre's equation with solvent flattening and histogram matching for phase extension and refinement of protein structures. Acta Cryst. A46, 377–381.Google Scholar








































to end of page
to top of page