International
Tables for
Crystallography
Volume F
Crystallography of biological macromolecules
Edited by E. Arnold, D. M. Himmel and M. G. Rossmann

International Tables for Crystallography (2012). Vol. F, ch. 18.8, pp. 525-528   | 1 | 2 |
https://doi.org/10.1107/97809553602060000862

Chapter 18.8. ARP/wARP – automated model building and refinement

V. S. Lamzin,a* A. Perrakisb and K. S. Wilsonc

aEuropean Molecular Biology Laboratory (EMBL), Hamburg Outstation, c/o DESY, Notkestr. 85, 22603 Hamburg, Germany,bEuropean Molecular Biology Laboratory (EMBL), Grenoble Outstation, c/o ILL, Avenue des Martyrs, BP 156, 38042 Grenoble CEDEX 9, France, and cStructural Biology Laboratory, Department of Chemistry, University of York, Heslington, York YO10 5DD, England
Correspondence e-mail:  victor@embl-hamburg.de

The ARP/wARP suite enables automated model building and refinement for macromolecular crystallography. It is based on the use of hybrid models integrated with model refinement and reconstruction to provide a unified approach. The ARP/wARP protocols are computationally efficient and provide an easy-to-use pipeline for the construction of structures of proteins, polynucleotides, bound ligands and solvent.

A macromolecular model is an interpretation of the end result of an X-ray experiment – the electron-density map – and is aimed to be consistent with both the experimental data and stereochemical knowledge. At the start of the process, the crystallographer must construct an initial molecular model in the electron density. In the early days this was done literally with nuts and bolts, but was subsequently carried out with the aid of interactive modelling software. Over the last decade, this time-demanding and highly subjective step, heavily dependent on the researcher's experience, has been largely substituted by automated procedures. Automated model building programs, of which ARP/wARP was one of the first to be widely used, deliver high-quality molecular models rapidly and efficiently from a sufficiently good set of initial phases.

18.8.1. Refinement and model building are two parts of modelling a structure

| top | pdf |

The conventional view of crystallographic refinement of macromolecules is the optimization of the parameters of a model to fit both the experimental data and a set of a priori stereochemical observations. The user provides a model, whose parameters are allowed to vary during a set minimization cycles based on least-squares or maximum-likelihood protocols. In conventional refinement procedures, the presence of the atoms is fixed, i.e. there is no addition or removal of parts of the model. As a result, users are often faced with a situation where several atoms in the model lie in one place, while the density maps suggest an entirely different location. The minimization procedures themselves are unable to make such major modifications to the model. Manual intervention, consisting of moving atoms to a more appropriate place using molecular graphics, density maps and geometrical assumptions, is required to enable improvement of the model to proceed.

ARP/wARP has built on two innovative concepts (Lamzin & Wilson, 1993[link], 1997[link]; Perrakis et al., 1999[link]).

  • (1) It challenged the previous separation of model building and refinement software by taking a more general view of the underlying phase optimization through coordinate manipulation and by incorporating the data-driven extension of the macromolecular model as part of an integrated approach. Addition and/or deletion of atoms (model update) and com­plete re-evaluation of the model with an aim of creating a modified one that better describes the electron density (model reconstruction) are essential components. Atom removal is primarily based on the interpolated likelihood-weighted electron density at an atomic centre and the agreement of the atomic density distribution with a target shape. Atom addition uses the difference likelihood-gradient synthesis. The selection is based on map grid points rather than peaks, as the latter are often poorly defined and may overlap with neighbouring peaks or existing atoms, especially if the resolution and phases are poor.

  • (2) ARP/wARP also challenged the view that throughout the structure-determination process a macromolecular model should always consist of a known set of atoms of an appropriate type, connected by bonds according to the known stereochemistry. The concept of a hybrid model, described in detail in Section 18.8.2[link][link], was introduced where a partial macromolecular model is deduced from the interpretable part of the electron density, while the remaining regions are modelled with free atoms that have undefined chemical identity and no bonds. This placed structure solution on a more unified foundation and enabled automation in linking the entire procedure.

The hybrid model undergoes optimization with the REFMAC engine (Murshudov et al., 1997[link]), where the model parameters are adjusted to fit better the experimental data and a priori stereochemical expectations. If the quality of the hybrid model is sufficiently high, the phases improve overall and a more complete model can be constructed from the enhanced electron density. The hybrid model iteratively evolves (precisely what is needed to escape from a local minimum of the optimization landscape) and should converge on the final macromolecular model, with the remaining free atoms approximating the surrounding solvent structure. The fundamental limitations of ARP/wARP as well as related macromolecular structure determination approaches are the amount of the available X-ray data (resolution) and the quality of the initial phase estimates.

18.8.2. Free-atom and hybrid models

| top | pdf |

A crystallographic electron-density map is always sampled on a regular grid. The goal of macromolecular model building is to condense the electron-density information to a molecular model. ARP/wARP entirely follows the paradigm of syntactic model construction starting from free atoms, through candidate Cα atoms, peptide and dipeptide units and chain fragments to sequence docking and side-chain building. As a first step, ARP/wARP condenses the map information to a set of free atoms (Agarwal & Isaacs, 1977[link]), as described by Lamzin & Wilson (1997[link]), that have undefined chemical identity: these atoms are chosen to represent the electron density as accurately as possible, but resemble a protein-like model in their distribution. The free atoms can be complemented by an initial (partial) protein model if available. To build a free-atom model, only an estimate of the molecular weight of the protein is required, without any sequence information. For example, for protein-model building a map covering a crystallographic asymmetric unit on a fine grid of about 0.33 Å is constructed. The model is gradually expanded from a seed by the stepwise addition of free atoms in lower electron density and at bonding distances from existing atoms. The procedure continues until the number of atoms is about 3n, where n is the number expected. This number is then reduced to about n + 10% atoms by removing atoms in the lowest density. This procedure has the advantage that it places the free atoms at protein-like distances while covering the whole volume of the protein.

Such a free-atom model can describe almost every feature of an electron-density map, but this interpretation rarely provides a conventional picture of a protein. Nevertheless, information from parts of the improved map and the free-atom model can be used to recognize automatically real elements of protein structure by applying the model-reconstruction algorithms, so that at least a partial atomic protein model can be built. During iterative model building and refinement some free atoms gain chemical identity, whereas others remain free. Combination of this partial protein model with a free-atom set (a hybrid model) allows a considerably better description of the current map. The protein model provides additional information (in the form of stereochemical restraints), while prominent features in the electron density (unaccounted for by the current model) are described by free atoms. Importantly, use of the atomic positions in the hybrid model, rather than map grid points, as guides for model building in the electron-density maps allows the implementation of com­pu­tationally more efficient algorithms.

18.8.3. ARP/wARP applications

| top | pdf |

Although ARP/wARP started life as a tool to build proteins, it now provides a wider spectrum of functionalities. As of the software version 7.1 (2009) ARP/wARP tackles several tasks in macromolecular crystallography:

  • (i) iterative protein-model building including a decision-making control module;

  • (ii) fast construction of the secondary structure of a protein;

  • (iii) sequence docking and side-chain building and refinement;

  • (iv) building flexible loops in alternate conformations;

  • (v) building DNA or RNA polynucleotide fragments;

  • (vi) fully automated placement of ligands, including a choice of the best-fitting ligand from a `cocktail'; and

  • (vii) locating ordered water molecules.

ARP/wARP protocols are computationally efficient so that the time required is typically of the order of a few minutes on modern workstations, although iterative protein-model building may take a few hours. ARP/wARP applications are briefly described below, but for a detailed overview a reader is referred to Langer et al. (2008[link]) or the literature cited hereafter.

18.8.3.1. Iterative protein-model building

| top | pdf |

If initial experimental phase information is available, the hybrid models are used as the main tool for obtaining as full a protein model as possible from the map calculated with the initial phases. A molecular-replacement solution can be directly used as an initial hybrid model. Given the information contained in the hybrid model in the traditional form of stereochemical restraints, refinement can work more efficiently, new improved phases can be obtained and a more accurate and complete protein model can be constructed. The new hybrid model is re-input to the refinement and these steps are iterated so that improved phases result in the construction of longer fragments of the protein chains. An almost complete protein model can often be obtained in a fully automated way (Fig. 18.8.3.1[link]).

[Figure 18.8.3.1]

Figure 18.8.3.1 | top | pdf |

A flow chart of the iterative building of a protein structure with ARP/wARP.

Automatic density-map interpretation for proteins is based on the location of the atoms in the current model and consists of several steps. Firstly, each pair of atoms in the free-atom model located at a distance of 3.8 ± 1.0 Å is assigned a potential peptide connection. The method utilizes the fact that all residues that comprise a protein, with the exception of cis-peptides, have chemically and structurally identical main-chain fragments with a highly characteristic, close-to-planar shape: the Cα—C—O—N—Cα trans-peptide units.

The problem of searching for possible peptide units and their connections thus becomes straightforward. Peptide recognition between potential Cα pairs takes place by matching the electron density that surrounds each potential pair (Lamzin & Wilson, 1997[link]) to that pre-computed for true Cα pairs from known structures. The peptide search uses density shapes sampled at 915 points, which account not only for the resolution of the X-ray data, but also for the fall-off in the diffraction intensity with resolution (the Wilson B factor). The important constraint here is that proteins are composed of linear non-branching polypeptide chains, allowing sets of connected peptides to be obtained from an initial list of all possible tracings. A limited depth-first graph-search algorithm is used for the assembly of peptides into linear polypeptide chain fragments for protein chain tracing (Morris et al., 2002[link]). Selection of the direction of a chain path is based on the electron density and observed backbone conformations. However, the set of peptide units and the list of how they are interconnected do not allow unambiguous tracing of a full-length chain for most structures.

The crucial bottleneck in automatically reconstructing a protein model from electron-density maps is achieving an initial trace of the polypeptide chain, typically approximated by many or hopefully only a few polypeptide fragments. Iteratively, these fragments should be made longer, their number should decrease and possible tracing errors, arising from poor starting phases, are eliminated. Following that, the fragments have to be placed in sequence, side chains built and gaps between the fragments that have been docked in sequence must be filled.

The complexity of the chain tracing can be illustrated as follows. Let us assume an example with 10 000 trial peptide units, where each free atom has, on average, two incoming and two outgoing peptides; this is typical for a good electron-density map of a 100 kDa (1000 residue) protein. Assuming that both ends of the chain are known and that it is possible to connect all the points regardless of the chosen route, then one is faced with the problem of choosing the best chain out of 2999 candidates. Screening all possible pairs of chain ends increases this number even further. Clearly, an exhaustive search is not computationally feasible. In practice, the situation is even more complex, since not all trial peptides are necessarily correctly identified in the first iteration, and some may be missing – analogous to the correctness or incorrectness of the atomic positions mentioned above. ARP/wARP uses a number of algorithmic shortcuts and con­denses a vast amount of information, e.g. for the above example from 106 grid points of the map, through 104 free atoms, 2 × 104 peptide candidates, 5 × 104 dipeptide connections, down to 2 × 103 possible peptides and finally into a model with 104 atoms. Protein residues are differentiated only as glycine, alanine, serine and valine, and complete side chains are not built at this stage.

For every residue in the traced polypeptide fragments, a side-chain type is assigned with an empirical probability, using either connectivity criteria between free atoms and the α-carbon position, or rotamer conformations and their correlation to the density map (Cohen et al., 2004[link]). Given these guesses for the side chains and the known primary sequence, each polypeptide fragment is docked into sequence. Following this assignment, side chains are built in the best rotamer configuration.

After sequence docking, the missing parts of the model, typically extended and poorly ordered surface loops, can be identified. Using a distribution of five-Cα fragments derived from known structures, a number of structurally possible con­formations are constructed. The procedure follows a pattern-recognition-based approach, where potential Cα candidates are looked for in non-negative density, followed by the construction of peptide planes and validation of dipeptide geometry using the Ramachandran plot. The loops are then scored based on the density and the best fitting ones are selected (Joosten et al., 2008[link]).

In essence, ARP/wARP improves the convergence properties of map interpretation during the course of protein building and refinement in the form of partial stereochemical assignments of the model. The landscape of the objective function used in refinement thus undergoes topological changes, which minimize model bias and favour convergence to a global minimum. Taken together, the probabilistic identification of the peptide units, the naturally high conformational flexibility of the connections of the peptide units, and the limited quality of the X-ray data and/or phases introduce large enough errors to cause density breaks in the middle of the chains or result in density overlaps. The result of such a tracing is usually a set of several main-chain fragments. The less accurate the starting maps (i.e. initial phases) and the lower the resolution and quality of the X-ray data, the more breaks there will be in the tracing and the higher the number of peptide units that will be difficult to identify.

18.8.3.2. Recognition of secondary structural elements

| top | pdf |

At a resolution of 3.0 Å and worse, where electron-density maps lack atomic features, ARP/wARP uses a different algorithm to build an initial model. Instead of free atoms, sparse map grid points with ~1 Å spacing are selected as potential Cα atoms on the basis of their density. A statistical pattern-classification approach based on supervised learning is applied. Numerical features corresponding to the distances and angles within building units larger than single peptides that have well defined patterns in electron-density maps are exploited using a nonlinear discriminant as the classification engine. This parameterizes the distribution of the feature values by a suitable hyperbody (e.g. an ellipsoid) and takes the body isosurfaces as decision boundaries. The discriminant threshold is set so as to minimize the number of false negatives in the net classification result. The process scores and filters fragments, and then assembles longer fragments iteratively. These undergo the same filtering with an adapted discriminant function, in a hierarchical order to make this procedure computationally inexpensive. The result is a set of Cα fragments that conform to known α-helical or β-stranded conformations. Finally, peptide backbone and Cβ atoms are added, the chain fragments are refined in real space, and the most probable chain direction is selected.

When the protein structure is (nearly) complete, smaller compounds (polynucleotides, ligands, cofactors and solvent) bound to the protein can be modelled in the difference electron-density map as outlined below.

18.8.3.3. Building polynucleotide fragments

| top | pdf |

Compared to proteins, crystals of nucleic acids in complex with proteins tend to have a less ordered crystalline arrangement, diffract less well and yield structures that exhibit higher atomic displacement parameters. Like the search for the secondary structural elements, ARP/wARP uses sparse grid points of the electron density and identifies candidate planar bases and phos­phate groups that are expected to scatter strongly and be easier to discriminate (Hattne & Lamzin, 2008[link]). The numerical algorithm is based on the use of third-order moment invariants. The method does not rely on any particular spatial arrangement of nucleotides and allows for wide structural variance in single- or double-stranded nucleic acids, especially in RNA. The problem of polynucleotide chain tracing is formulated in a manner analogous to that for proteins, but instead of peptides the search is carried out for phosphate–sugar–phosphate groups using pre-computed 915-point density shapes. The nucleotide chains are assembled like those of proteins with the difference that tracing a protein main chain accounts for two conformational degrees of freedom per residue, while a nucleotide unit has five.

18.8.3.4. Building bound ligands

| top | pdf |

To facilitate ligand building, all regions of difference density that have approximately the same volume as the ligand are identified. Subsequently, several numeric features of the density region and its sparse representation are used to produce an ensemble of putative ligand structures to best fit the local density. All ligand model candidates are refined in real space to fit the density shape and the single best model is chosen (Evrard et al., 2007[link]). A technique that compares the shapes of difference-electron-density blobs with the shape of the ligand to be built is used to distinguish compounds from a list (cocktail) of ligand candidates. The ligand that fits best is selected for further construction of the ensemble and subsequent restrained refinement.

18.8.3.5. Solvent building

| top | pdf |

In this application, the protein model (together with bound metals, ligands or nucleic acids) is not rebuilt during refinement. Instead, only the solvent structure is continuously updated. Since ordered solvent in a typical crystal structure comprises about 10% of the model, improvement of solvent indirectly improves the density corresponding to the protein part.

18.8.4. Iterations

| top | pdf |

Building protein chains or solvent with ARP/wARP proceeds in an iterative fashion. When the quality of the (partially built) model is sufficiently high, the phases improve overall and result in an enhanced electron density where a more accurate and more complete model may be built. An important component within iterations is the model update, described above, where parts of the existing model located in weak density are removed and new atoms added where the density acquires pronounced features. The iterative approach can also be extended to the building of ligands and nucleic acids.

18.8.5. Applicability and requirements

| top | pdf |

Model building with ARP/wARP is dependent on the resolution and quality of the X-ray data. Highly complete and accurate protein models are obtained at a resolution of 2.7 Å or higher provided that reasonable starting phase estimates are available. Success has been reported at lower resolution, but it is strongly dependent on the initial phase quality. Recent advances in refinement using experimental phasing information (Skubák et al., 2009[link]) proved essential for successful automated model building in a number of difficult test cases. The resolution of the X-ray data places similar limitations on ARP/wARP ligand building. Building secondary structural elements does not use free atoms and is applicable at a resolution of as low as 4.5 Å. We note that the quoted resolution limits relate to the current version of the software and are likely to change in the future.

References

Agarwal, R. C. & Isaacs, G. (1977). Method for obtaining a high resolution protein map starting from a low resolution map. Proc. Natl Acad. Sci. USA, 74, 2835–2839.
Cohen, S. X., Morris, R. J., Fernandez, F. J., Ben Jelloul, M., Kakaris, M., Parthasarathy, V., Lamzin, V. S., Kleywegt, G. J. & Perrakis, A. (2004). Towards complete validated models in the next generation of ARP/wARP. Acta Cryst. D60, 2222–2229.
Evrard, G. X., Langer, G. G., Perrakis, A. & Lamzin, V. S. (2007). Assessment of automatic ligand building in ARP/wARP. Acta Cryst. D63, 108–117.
Hattne, J. & Lamzin, V. S. (2008). Pattern-recognition-based detection of planar objects in three-dimensional electron-density maps. Acta Cryst. D64, 834–842.
Joosten, K., Cohen, S. X., Emsley, P., Mooij, W., Lamzin, V. S. & Perrakis, A. (2008). A knowledge-driven approach for crystallographic protein model completion. Acta Cryst. D64, 416–424.
Lamzin, V. S. & Wilson, K. S. (1993). Automated refinement of protein models. Acta Cryst. D49, 129–147.
Lamzin, V. S. & Wilson, K. S. (1997). Automated refinement for protein crystallography. Methods Enzymol. 277, 269–305.
Langer, G., Cohen, S. X., Lamzin, V. S. & Perrakis, A. (2008). Automated macromolecular model building for X-ray crystallography using ARP/wARP version 7. Nat. Protoc. 3, 1171–1179.
Morris, R. J., Perrakis, A. & Lamzin, V. S. (2002). ARP/wARP's model-building algorithms. I. The main chain. Acta Cryst. D58, 968–975.
Murshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). Refinement of macromolecular structures by the maximum-likelihood method. Acta Cryst. D53, 240–255.
Perrakis, A., Morris, R. & Lamzin, V. S. (1999). Automated protein model building combined with iterative structure refinement. Nat. Struct. Biol. 6, 458–463.
Skubák, P., Murshudov, G. & Pannu, N. S. (2009). A multivariate like­lihood SIRAS function for phasing and model refinement. Acta Cryst. D65, 1051–1061.








































to end of page
to top of page