International
Tables for
Crystallography
Volume F
Crystallography of biological macromolecules
Edited by E. Arnold, D. M. Himmel and M. G. Rossmann

International Tables for Crystallography (2012). Vol. F, ch. 21.4, pp. 684-687   | 1 | 2 |
https://doi.org/10.1107/97809553602060000882

Chapter 21.4. PROCHECK: validation of protein-structure coordinates

R. A. Laskowski,a* M. W. MacArthurb and J. M. Thorntonc

aDepartment of Crystallography, Birkbeck College, University of London, Malet Street, London WC1E 7HX, England,bBiochemistry and Molecular Biology Department, University College London, Gower Street, London WC1E 6BT, England, and cBiochemistry and Molecular Biology Department, University College London, Gower Street, London WC1E 6BT, England, and Department of Crystallography, Birkbeck College, University of London, Malet Street, London WC1E 7HX, England
Correspondence e-mail:  roman@ebi.ac.uk

The structure-validation program PROCHECK is described.

21.4.1. Introduction

| top | pdf |

As in all scientific measurements, the parameters that result from a macromolecular structure determination by X-ray crystallography (e.g. atomic coordinates and B factors) will have associated uncertainties. These arise not only from systematic and random errors in the experimental data but also in the interpretation of those data. Currently, the uncertainties cannot easily be estimated for macromolecular structures due to the computer- and memory-intensive nature of the calculations required (Tickle et al., 1998[link]). Thus, more indirect methods are necessary to assess the reliability of different parts of the model, as well as the reliability of the model as a whole. Among these methods are those that rely on checking only the stereochemical and geometrical properties of the model itself, without reference to the experimental data (MacArthur et al., 1994[link]; Laskowski et al., 1998[link]). Here we describe PROCHECK (Laskowski et al., 1993[link]), which is one of these structure-validation methods.

The PROCHECK program computes a number of stereochemical parameters for the given protein model and compares them with `ideal' values obtained from a database of well refined high-resolution protein structures in the Protein Data Bank (PDB; Bernstein et al., 1977[link]). The results of these checks are output in easy-to-understand coloured plots in PostScript format (Adobe Systems Inc., 1985[link]). Significant deviations from the derived standards of normality are highlighted as being `unusual'.

The program's primary use is during the refinement of a protein structure; the highlighted regions can direct the crystallographer to parts of the structure that may have problems and which may need attention. It should be noted that outliers may just be outliers; they are not necessarily errors. Unusual features may have a reasonable explanation, such as distortions due to ligand binding in the protein's active site. However, if there are many oddities throughout the model, this could signify that there is something wrong with it as a whole. Conversely, if a model has good stereochemistry, this alone is not proof that it is a good model of the protein structure.

Because the program requires only the three-dimensional atomic coordinates of the structure, it can check the overall `quality' of any model structure: whether derived experimentally by crystallography or NMR, or built by homology modelling. In the case of NMR-derived structures, it is useful to compare the protein geometry across the whole ensemble. An extended version of PROCHECK, called PROCHECK-NMR, is available for this purpose (Laskowski et al., 1996[link]), but will not be described here.

Note that PROCHECK only examines the geometrical properties of protein molecules; it ignores DNA/RNA and other non-protein molecules in the structure, except in so far as checking that the non-bonded contacts these make with the protein do not violate a fixed distance criterion.

21.4.2. The program

| top | pdf |

PROCHECK is in fact a suite of separate Fortran and C programs which are run successively via a shell script. The programs first `clean up' the input PDB file, relabelling certain side-chain atoms according to the IUPAC naming conventions (IUPAC–IUB Commission on Biochemical Nomenclature, 1970[link]), then calculate all the protein's stereochemical parameters to compare them against the norms, and finally generate the PostScript output and a detailed residue-by-residue listing. Hydrogen and atoms with zero occupancy are omitted from the analyses and, where atoms are found in alternate conformations, only the highest-occupancy conformation is retained.

The source code for all the programs is available at http://www.biochem.ucl.ac.uk/∼roman/procheck/procheck.html . It has also been incorporated into the CCP4 suite of programs (Collaborative Computational Project, Number 4, 1994[link]) at http://www.dl.ac.uk/CCP/CCP4/main.html , and can be run directly via the web from the Biotech Validation Server at http://biotech.embl-ebi.ac.uk:8400/ .

21.4.3. The parameters

| top | pdf |

Table 21.4.3.1[link] shows the principal stereochemical parameters used by PROCHECK, based on the analysis of Morris et al. (1992)[link], who looked for measures that are good indicators of protein quality. The table shows the original parameters together with a more up-to-date set derived from a more recent data set including a number of atomic resolution structures (i.e. those solved to 1.4 Å resolution or better).

Table 21.4.3.1| top | pdf |
Summary of expected values for stereochemical parameters in well resolved structures

ParameterOldNew
% ϕ, ψ in core [\gt\! 90.0\%] [\gt\! 90.0\%]
[\chi_{1}\ gauche^{-}] [+ 64.1 \pm 15.7^{\circ}] [+ 63.2 \pm 11.4^{\circ}]
[\chi_{1}\ trans] [+ 183.6 \pm 16.8^{\circ}] [+ 182.7 \pm 13.1^{\circ}]
[\chi_{1}\ gauche^{+}] [- 66.7 \pm 15.0^{\circ}] [- 66.0 \pm 11.2^{\circ}]
[\chi_{1}] pooled standard deviation [\pm 15.7^{\circ}] [\pm 11.8^{\circ}]
[\chi_{2}] trans [+ 177.4 \pm 18.5^{\circ}] [+ 177.2 \pm 15.1^{\circ}]
[\chi_{3}] S—S bridge (left-handed) [- 85.8 \pm 10.7^{\circ}] [- 84.8 \pm 8.5^{\circ}]
[\chi_{3}] S—S bridge (right-handed) [+ 96.8 \pm 14.8^{\circ}] [+ 92.2 \pm 10.8^{\circ}]
Proline ϕ [- 65.4 \pm 11.2^{\circ}] [- 64.6 \pm 10.2^{\circ}]
α-Helix ϕ [- 65.3 \pm 11.9^{\circ}] [- 65.5 \pm 11.1^{\circ}]
α-Helix ψ [- 39.4 \pm 11.3^{\circ}] [- 39.0 \pm 9.8^{\circ}]
ω trans [+ 179.6 \pm 4.7^{\circ}] [+ 179.5 + 6.0^{\circ}]
Cα—N—C′—Cβ (ζ) virtual torsion angle [+ 33.9 \pm 3.5^{\circ}] [+ 34.2 \pm 2.6^{\circ}]

For the most part, the parameters given in Table 21.4.3.1[link] are not included in standard refinement procedures and so are less likely to be biased by them. They can thus provide a largely independent and unbiased validation check on the geometry of each residue and hence point to regions of the protein structure that are genuinely unusual.

As more atomic resolution structures become available (Dauter et al., 1997[link]), these parameters will be improved. Because of their high data-to-parameter ratio, such structures can be refined using less strict restraints, and hence contain a smaller degree of bias in their geometrical properties – at least for the well ordered parts of the model. Such information moves us a step closer to an understanding of the `true' geometrical and conformational properties of proteins in general and, one day, the target parameters will be derived exclusively from such structures.

PROCHECK also checks main-chain bond lengths and bond angles against the `ideal' values given by the Engh & Huber (1991)[link] analysis of small-molecule structures in the Cambridge Structural Database (CSD) (Allen et al., 1979[link]). Unlike the above parameters, these geometrical properties are usually restrained during refinement, and, furthermore, the Engh & Huber (1991)[link] targets are the ones most commonly applied. Thus analyses of these values merely reflect the refinement protocol used and do not provide meaningful indicators of local or overall errors. However, the plots clearly show any wayward outliers which can nevertheless indicate problem regions in the structure.

21.4.4. Which parameters are best?

| top | pdf |

Possibly the most telling and useful of the `quality' indicators for a protein model is the Ramachandran plot of residue ϕ–ψ torsion angles. This can often detect gross errors in the structure (Kleywegt & Jones, 1996a[link],b[link]). In the original Ramachandran plot (Ramachandran et al., 1963[link]; Ramakrishnan & Ramachandran, 1965[link]), the `allowed' regions were defined on the basis of simulations of dipeptides. In the PROCHECK version, the different regions of the plot are defined on the basis of how densely they are populated with data points taken from a database of well refined protein structures. The regions are: core, allowed, generously allowed and disallowed.

The `core' regions are particularly important; the points on the plot tend to converge towards these regions, and to cluster more tightly within them, as one goes from structures solved at low resolution to those solved at high resolution (Morris et al., 1992[link]). This trend has recently been confirmed by Wilson et al. (1998)[link], who looked at the case of atomic resolution structures. It has also been analysed in terms of `attractors' at the most favourable regions of the plot; as the resolution improves, so the points are drawn towards these attractors (Walther & Cohen, 1999[link]).

Fig. 21.4.4.1[link] shows the original PROCHECK Ramachandran plot and a more up-to-date version. The original was based on all 462 structures known at that time (1989/90), while the more recent one, generated in 1998, is based on 1128 non-identical (i.e. having a sequence identity [\lt]95%) structures. It can be seen that the second plot has core regions which are much tighter than the original, and this is primarily due to the increase in the number of very high resolution structures giving a more accurate representation of the tight clustering in the most favourable regions.

[Figure 21.4.4.1]

Figure 21.4.4.1 | top | pdf |

PROCHECK Ramachandran plots showing the different regions, shaded according to how `favourable' the ϕ–ψ combinations are, for (a) the original version of the program (1992) and (b) an updated version based on a more recent data set (1998) including more high-resolution structures. The `core' and other favourable regions of the plot are more tightly compressed in the new version, with the white, disfavoured regions occupying more of the space.

Another parameter that seems to be a particularly sensitive measure of quality is the standard uncertainty (s.u.) of the χ torsion angles. Morris et al. (1992)[link] found that the average values of a protein's χ1 and χ2 torsion angles are well correlated with the resolution at which the protein structure was solved. Although the data set was a fairly small one, the conclusion was borne out when tested on a larger set of more recent structures, including some solved to atomic resolution (Wilson et al., 1998[link]). This measure, however, cannot be relied on where side-chain conformations are either restrained or heavily influenced by the use of rotamer libraries.

21.4.5. Input

| top | pdf |

The primary input to PROCHECK is the file containing the 3D coordinates of the protein structure to be processed. The file is required to be in PDB format. An additional input file is the parameter file that governs which plots are to be generated and deals with certain aspects of their appearance.

21.4.6. Output produced

| top | pdf |

The output of the program consists of a number of PostScript plots, together with a full listing of the individual parameter values for each residue, with any unusual geometrical properties highlighted. The listing also provides summaries for the protein as a whole. Figs. 21.4.6.1[link] and 21.4.6.2[link] show parts of one of the PostScript plots generated, showing the variation of various residue properties along the length of the protein chain. Unusual regions, which are highlighted on these plots, may require further investigation by the crystallographer.

[Figure 21.4.6.1]

Figure 21.4.6.1 | top | pdf |

Two of the residue-property plots generated by PROCHECK. The plots shown here are (a) the absolute deviation from the mean of the χ1 torsion angle (excluding prolines) and (b) the absolute deviation from the mean of the ω torsion angle. Usually, three such plots are shown per page and can be selected from a set of 14 possible plots. On each graph, unusual values (usually those more than 2.0 standard deviations away from the `ideal' mean value) are highlighted.

[Figure 21.4.6.2]

Figure 21.4.6.2 | top | pdf |

Schematic plots of various residue-by-residue properties, showing (d) the protein secondary structure, with the shading behind it giving an approximation to each residue's accessibility, the darker the shading the more buried the residue; (e) the protein sequence plus markers identifying the region of the Ramachandran plot in which the residue is located; (f) a histogram of asterisks and plus signs showing each residue's maximum deviation from one of the ideal values, as shown on the residue-by-residue listing; and (g) the residue `G factor' values for various properties, where the darker the square the more `unusual' the property.

21.4.7. Other validation tools

| top | pdf |

PROCHECK is merely one of a number of validation tools that are freely available, some of which are mentioned elsewhere in this volume. The best known are WHATCHECK (Hooft et al., 1996[link]), PROVE (Pontius et al., 1996[link]), SQUID (Oldfield, 1992[link]) and VERIFY3D (Eisenberg et al., 1997[link]). Tools such as OOPS (Kleywegt & Jones, 1996b[link]) or the X-build validation in QUANTA (MSI, 1997[link]) provide standard tests on the geometry of a structure and provide lists of residues with unexpected features, which make it easy to check electron-density maps at suspect points.

Acknowledgements

Significant contributors to the programs in the PROCHECK suite include David K. Smith, E. Gail Hutchinson, David T. Jones, J. Antoon C. Rullmann, A. Louise Morris and Dorica Naylor. Part of the development work was funded by a grant from the EU Framework IV Biotechnology programme, contract CT96–0189.

References

Adobe Systems Inc. (1985). PostScript Language Reference Manual. Reading, MA: Addison-Wesley.
Allen, F. H., Bellard, S., Brice, M. D., Cartwright, B. A., Doubleday, A., Higgs, H., Hummelink, T., Hummelink-Peters, B. G., Kennard, O., Motherwell, W. D. S., Rodgers, J. R. & Watson, D. G. (1979). The Cambridge Crystallographic Data Centre: computer-based search, retrieval, analysis and display of information. Acta Cryst. B35, 2331–2339.
Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F. Jr, Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). The Protein Data Bank: a computer-based archival file for macromolecular structures. J. Mol. Biol. 112, 535–542.
Collaborative Computational Project, Number 4 (1994). The CCP4 suite: programs for protein crystallography. Acta Cryst. D50, 760–763.
Dauter, Z., Lamzin, V. S. & Wilson, K. S. (1997). The benefits of atomic resolution. Curr. Opin. Struct. Biol. 7, 681–688.
Eisenberg, D., Lüthy, R. & Bowie, J. U. (1997). VERIFY3D: assessment of protein models with three-dimensional profiles. Methods Enzymol. 277, 396–404.
Engh, R. A. & Huber, R. (1991). Accurate bond and angle parameters for X-ray protein structure refinement. Acta Cryst. A47, 392–400.
Hooft, R. W. W., Sander, C., Vriend, G. & Abola, E. E. (1996). Errors in protein structures. Nature (London), 381, 272.
IUPAC–IUB Commission on Biochemical Nomenclature (1970). Abbreviations and symbols for the description of the conformation of polypeptide chains. J. Mol. Biol. 52, 1–17.
Kleywegt, G. J. & Jones, T. A. (1996a). Phi/psi-chology: Ramachandran revisited. Structure, 4, 1395–1400.
Kleywegt, G. J. & Jones, T. A. (1996b). Efficient rebuilding of protein structures. Acta Cryst. D52, 829–832.
Laskowski, R. A., MacArthur, M. W., Moss, D. S. & Thornton, J. M. (1993). PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Cryst. 26, 283–291.
Laskowski, R. A., MacArthur, M. W. & Thornton, J. M. (1998). Validation of protein models derived from experiment. Curr. Opin. Struct. Biol. 8, 631–639.
Laskowski, R. A., Rullmann, J. A. C., MacArthur, M. W., Kaptein, R. & Thornton, J. M. (1996). AQUA and PROCHECK-NMR: programs for checking the quality of protein structures solved by NMR. J. Biomol. Nucl. Magn. Reson. 8, 477–486.
MacArthur, M. W., Laskowski, R. A. & Thornton, J. M. (1994). Knowledge-based validation of protein structure coordinates derived by X-ray crystallography and NMR spectroscopy. Curr. Opin. Struct. Biol. 4, 731–737.
Morris, A. L., MacArthur, M. W., Hutchinson, E. G. & Thornton, J. M. (1992). Stereochemical quality of protein structure coordinates. Proteins, 12, 345–364.
MSI (1997). QUANTA. MSI, 9685 Scranton Road, San Diego, CA 92121–3752, USA.
Oldfield, T. J. (1992). SQUID: a program for the analysis and display of data from crystallography and molecular dynamics. J. Mol. Graphics, 10, 247–252.
Pontius, J., Richelle, J. & Wodak, S. (1996). Deviations from standard atomic volumes as a quality measure for protein crystal structures. J. Mol. Biol. 264, 121–136.
Ramachandran, G. N., Ramakrishnan, C. & Sasisekharan, V. (1963). Stereochemistry of polypeptide chain configurations. J. Mol. Biol. 7, 95–99.
Ramakrishnan, C. & Ramachandran, G. N. (1965). Stereochemical criteria for polypeptide and protein chain conformations. II. Allowed conformations for a pair of peptide units. Biophys. J. 5, 909–933.
Tickle, I. J., Laskowski, R. A. & Moss, D. S. (1998). Error estimates of protein structure coordinates and deviations from standard geometry by full-matrix refinement of γB- and γB2-crystallin. Acta Cryst. D54, 243–252.
Walther, D. & Cohen, F. E. (1999). Conformational attractors on the Ramachandran map. Acta Cryst. D55, 506–517.
Wilson, K. S., Butterworth, S., Dauter, Z., Lamzin, V. S., Walsh, M., Wodak, S., Pontius, J., Richelle, J., Vaguine, A., Sander, C., Hooft, R. W. W., Vriend, G., Thornton, J. M., Laskowski, R. A., MacArthur, M. W., Dodson, E. J., Murshudov, G., Oldfield, T. J., Kaptein, R. & Rullmann, J. A. C. (1998). Who checks the checkers? Four validation tools applied to eight atomic resolution structures. J. Mol. Biol. 276, 417–436.








































to end of page
to top of page