Tables for
Volume F
Crystallography of biological macromolecules
Edited by E. Arnold, D. M. Himmel and M. G. Rossmann

International Tables for Crystallography (2012). Vol. F, ch. 23.1, pp. 749-751   | 1 | 2 |

Chapter 23.1. Protein-fold classification

C. Orengoa* and J. Thorntonb

aBiomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College, Gower Street, London WC1E 6BT, England, and bBiochemistry and Molecular Biology Department, University College London, Gower Street, London WC1E 6BT, England, and Department of Crystallography, Birkbeck College, University of London, Malet Street, London WC1E 7HX, England
Correspondence e-mail:

The classification of protein folds is discussed.

Since the first structure of myoglobin was solved in 1971, there has been an exponential growth in known protein structures with about 73 000 structures currently deposited in the Protein Data Bank (PDB; Abola et al., 1987[link]) and 500 or more solved each month. Structural genomics (SG) initiatives are now contributing significantly to this total; between 2005 and 2008 they accounted for some 18% of new structures, and, now that high-throughput pipelines have been set up, some 600–800 structures deposited to the PDB per year (around two structures a day) are from SG centres around the world (Rajesh et al., 2009[link]). When dealing with such large numbers it is necessary to organize the data in a manageable and biologically meaningful way. To this end, several structural classifications have been developed [SCOP (Murzin et al., 1995[link]), CATH (Orengo et al., 1997[link]), DALI (Holm & Sander, 1999[link]), 3Dee (Barton, 1997[link]), HOMSTRAD (Mizuguchi et al., 1998[link]) and ENTREZ (Hogue et al., 1996[link])], differing in their methodology and the degree of structural and functional annotation for the protein families identified.

Most public classification schemes have chosen to group proteins according to similarities in their domain structures, as this is generally considered to be the important evolutionary and folding unit. However, it can be difficult to identify domain boundaries either manually or using automatic algorithms, and although there are many methods available, it has been shown that even the most reliable algorithms only give the correct answer about 80% of the time (Jones et al., 1998[link]). Methods for recognizing domains are described in Chapter 23.2[link] .

Most protocols used for clustering protein domain structures into families first identify similarities in their sequences. There are many well established methods for doing this, most based on dynamic programming algorithms, and since proteins with sequence identities of 30% or more are known to adopt very similar folds (Sander & Schneider, 1991[link]; Flores et al., 1993[link]), it is relatively simple to cluster related proteins into evolutionary families on this basis. Very distant relatives (<20% sequence identity) are not easily identified by sequence alignment, but since structure is much more highly conserved during evolution, these relationships can be detected by comparing the 3D structures directly.

Various powerful algorithms have been developed for recognizing structurally related proteins (for reviews see Holm & Sander, 1994[link]; Brown et al., 1996[link]). These build on the rigid-body superposition methods of Rossmann & Argos (1975)[link], which compare intermolecular distances after optimal translation and rotation of one protein structure onto the other. Other methods are based on the distance plots developed by Phillips (1970)[link], which enable comparison of intramolecular distances between protein structures. In comparing very distantly related proteins, there are a number of problems which must be overcome. Insertions or deletions can obscure equivalent regions, though generally these appear in the loops between secondary structures. Residue substitutions can cause shifts in the orientations of the secondary structures in order to maintain optimal hydrophobic packing in the core.

A number of strategies have been developed for handling these problems. For example, some methods only consider secondary-structure elements, as these will contain fewer insertions. Artymiuk et al. (1989)[link] represent secondary structures as linear vectors and use fast, efficient comparison algorithms based on graph theory. Others have adapted rigid-body methods to optimally superpose secondary structures, ignoring loops. Some methods chop the proteins being compared into fragments and then use various energy-minimization approaches (e.g. simulated annealing, Monte Carlo optimization) to link equivalent fragments in the two proteins. Such fragments can be identified by rigid-body superposition (Vriend & Sander, 1991[link]) or, in the case of the DALI method (Holm & Sander, 1994[link]), by comparing contact maps for hexapeptide fragments. Several groups have modified the dynamic programming algorithms designed to cope with insertions or deletions in sequence comparison in order to compare three-dimensional (3D) information (Taylor & Orengo, 1989[link]; Sali & Blundell, 1990[link]; Russell & Barton, 1993[link]). For example, the SSAP method of Taylor & Orengo (1989)[link] uses double dynamic programming to align residue structural environments defined by vectors between Cβ atoms, whilst in STAMP (Russell & Barton, 1993[link]) dynamic programming is used in an iterative procedure, together with rigid-body superposition.

Once equivalent residues have been found, the degree of structural similarity between two proteins can be measured in a number of ways, though the most commonly used is the root-mean-square deviation (RMSD), which is effectively the average `distance' between superposed residues. However, there is still no consensus about which thresholds might imply homologous proteins or fold similarity between analogous proteins or common structural motifs. It is likely that this will become clearer as more structures are determined and the families become more highly populated, providing more information on tolerance to structural changes. These contraints will probably reflect functional requirements and/or kinetic or thermodynamic factors and will be specific to the family.

Several groups (Holm & Sander, 1999[link]; Hogue et al., 1996[link]) attempt to determine the significance of a structural match by considering the distribution of scores for unrelated proteins and calculating a Z score. These approaches are very reliable for proteins possessing unusual structural characteristics but may not be as sensitive for those with highly recurring and common structural motifs. Other groups use empirical approaches (Orengo et al., 1997[link]) to establish reasonable cutoffs for identifying homologues, though these approaches obviously suffer from the currently limited size of the structure data bank.

Because of the individual strategies used to recognize relatives, the protein-structure classifications differ somewhat in their assignments. However, most classifications group proteins having highly similar sequences (≥30%) into families. Subsequently, those families having highly similar structures and some other evidence of common ancestry [e.g. similar functions or some residual sequence identity (Orengo et al., 1999[link])] are merged into homologous superfamilies. Families adopting similar folds, but where there is no other evidence to suggest divergent evolution, are usually put into the same fold group but are described as analogous proteins, since their similarity may simply reflect the physical and/or chemical constraints on protein folding.

Recent analysis of CATH superfamilies has shown that some of the most highly populated superfamilies in CATH are structurally very diverse. Typically, domain relatives possess a common `structural core' but secondary structural embellishments to this core can be considerable and effectively result in homologous structures varying in size by threefold or more and having rather different global folds (Reeves et al., 2006[link]; Greene et al., 2007[link]; Cuff et al., 2009[link]). This poses some challenges to hierarchical domain structure classifications like SCOP and CATH. However, usually the evolutionary conserved cores of relatives within a domain superfamily possess highly similar folds.

SCOP and CATH are currently the largest of the public classifications, each with over 1900 homologous superfamilies. In SCOP (Murzin et al., 1995[link]), these families have been very carefully manually validated using biochemical information and by consideration of special structural features (e.g. rare β-bulges, left-handed helical connections) that may constitute evolutionary fingerprints; in CATH, homologues are validated both manually and automatically (Orengo et al., 1997[link]). Other databases [HOMSTRAD (Mizuguchi et al., 1998[link]); 3Dee (Barton, 1997[link])] contain similar groupings of protein structures, and there are multiple structural alignments for the family, annotated according to residue properties.

Several studies have suggested a limited number of folds available to proteins, with estimates ranging from one thousand to several thousand (Chothia, 1993[link]; Orengo et al., 1994[link]), and this will mean an increasing number of analogous protein pairs being identified as the structural genomics initiatives continue. Recent analyses of the population of different fold families have revealed that some folds are more highly populated, perhaps because they fold more easily or are more stable. In the CATH database, ten favoured folds, described as superfolds, comprised very regular, layered architectures and were shown to contain a higher proportion of favoured motifs (e.g. Greek key, βα motif) than non-superfold structures. Similarly, analysis of SCOP (Brenner et al., 1996[link]) revealed some 40 or so frequently occurring domains (FODS), which included the superfolds. About one-third of all non-homologous structures (<25% sequence identity to each other) adopt one of these folds.

Some groups avoid explicit definition of protein families. The DALI database of Holm & Sander (1999)[link] is a neighbourhood scheme listing all related proteins for a given protein structure. Neighbours are identified using the DALI structure-comparison algorithm (Holm & Sander, 1993[link]) and range from the most highly similar, homologous proteins to those sharing only motif similarities. The ENTREZ database (Hogue et al., 1996[link]) provides a similar scheme, generated by the VAST structure-comparison method of Gibrat et al. (1997[link]). Both allow the user to assess significance and draw their own inferences regarding evolutionary relationships. More recently, the DALI domain database (DDD) (Holm & Sander, 1998[link]) has provided clusters of related proteins based on calculated Z scores.

Most available databases further classify the fold groups on the basis of class. These agree with the major classes recognized by Levitt & Chothia (1976)[link] (mainly α, mainly β, α/β, α + β), although in the CATH database the α/β and α + β classes have been merged (Fig. 23.1.1)[link]. CATH also describes an intermediate architecture level between class and fold group (Orengo et al., 1997[link]). This refers to the arrangement of secondary-structure elements in 3D, regardless of their connectivity, and so defines the shape (e.g. barrel, sandwich, propeller) (Fig. 23.1.2[link]). There are currently 40 different architectures in CATH, with the simple barrel and sandwich shapes accounting for about 60% of the non-homologous structures.

[Figure 23.1.1]

Figure 23.1.1 | top | pdf |

Schematic representation of the (C)lass, (A)rchitecture and (T)opology/fold levels in the CATH database.

[Figure 23.1.2]

Figure 23.1.2 | top | pdf |

`CATHerine wheel' plot showing the distribution of non-homologous structures [i.e. a single representative from each homologous superfamily (H level) in CATH] amongst the different classes (C), architectures (A) and fold families (T) in the CATH database. Protein classes are shown coloured as red (mainly α), green (mainly β), yellow (α–β) and orange (few secondary structures). Within each class, the angle subtended for a given segment reflects the proportion of structures within the identified architectures (inner circle) or fold groups (outer circle). MOLSCRIPT (Kraulis, 1991[link]) illustrations are shown for representative examples from the superfold families.


Abola, E. E., Bernstein, F. C., Bryant, S. H., Koetzle, T. F. & Weng, J. (1987). Protein Data Bank. In Crystallographic Databases – Information Content, Software Systems, Scientific Applications, edited by F. H. Allen, G. Bergerhoff & R. Sievers, pp. 107–132.
Artymiuk, P. J., Mitchell, E. M., Rice, D. W. & Willett, P. (1989). Searching techniques for databases of protein structures. J. Inf. Sci. 15, 287–298.
Barton, G. J. (1997). 3Dee: database of protein domain definitions. .
Brenner, S. E., Chothia, C., Hubbard, T. J. & Murzin, A. G. (1996). Understanding protein structure. Using SCOP for fold interpretation. Methods Enzymol. 266, 635–643.
Brown, N. P., Orengo, C. A. & Taylor, W. R. (1996). A protein structure comparison methodology. Comput. Chem. 20, 359–380.
Chothia, C. (1993). One thousand families for the molecular biologist. Nature (London), 357, 543–544.
Cuff, A. L., Sillitoe, I., Lewis, T., Redfern, O. C., Garratt, R., Thornton, J. & Orengo, C. A. (2009). The CATH classification revisited – architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res. 37 (database issue), D310–D314.
Flores, T. P., Orengo, C. A. & Thornton, J. M. (1993). Conformational characteristics in structurally similar protein pairs. Protein Sci. 7, 31–37.
Gibrat, J. F., Madej, T., Spouge, J. L. & Bryant, S. H. (1997). The VAST protein structure comparison method. Biophys. J. 72, MP298.
Greene, L. H., Lewis, T. E., Addou, S., Cuff, A., Dallman, T., Dibley, M., Redfern, O., Peral, F., Nambudiry, R., Reid, A., Sillitoe, I., Yeats, C., Thornton, J. & Orengo, C. A. (2007). The CATH domain database: new protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic Acids Res. 35 (database issue), D291–D297.
Hogue, C. W., Ohkawa, H. & Bryant, S. H. (1996). A dynamic look at structures: WWW-Entrez and the molecular modelling database. Trends Biochem. Sci. 21, 226–229.
Holm, L. & Sander, C. (1993). Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233, 123–138.
Holm, L. & Sander, C. (1994). Searching protein structure databases has come of age. Proteins, 19, 165–173.
Holm, L. & Sander, C. (1998). Dictionary of recurrent domains in protein structures. Proteins, 33, 88–96.
Holm, L. & Sander, C. (1999). Protein folds and families: sequence and structure alignments. Nucleic Acids Res. 27, 244–247.
Jones, S., Stewart, M., Michie, A. D., Swindells, M. B., Orengo, C. A. & Thornton, J. M. (1998). Domain assignment for protein structures using a consensus approach: characterisation and analysis. Protein Sci. 7, 233–242.
Kraulis, P. J. (1991). MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. J. Appl. Cryst. 24, 946–950.
Levitt, M. & Chothia, C. (1976). Structural patterns in globular proteins. Nature (London), 261, 552–558.
Mizuguchi, K., Deane, C. A., Blundell, T. L. & Overerington, J. P. (1998). HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci. 7, 2469–2471.
Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: a structural classification of the protein database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540.
Orengo, C. A., Jones, D. T., Taylor, W. & Thornton, J. M. (1994). Protein superfamilies and domain superfolds. Nature (London), 372, 631–634.
Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. & Thornton, J. M. (1997). CATH – a hierarchic classification of protein domain structures. Structure, 5, 1093–1108.
Orengo, C. A., Pearl, F. M. G., Bray, J. E., Todd, A. E., Martin, A. C., LoConte, L. & Thornton, J. M. (1999). The CATH database provides insights into protein structure/function relationships. Nucleic Acids Res. 27, 275–279.
Phillips, D. E. (1970). British Biochemistry, Past and Present, p. 11. London Biochemistry Society Symposium. Academic Press.
Rajesh, N., Liu, J., Soong, T., Acton, T. B., Everett, J. K., Kouranov, A., Fiser, A., Godzik, A., Jaroszewski, L., Orengo, C., Montelione, G. T. & Rost, B. (2009). Structural genomics is the largest contributor of novel structural leverage. J. Struct. Funct. Genomics, 10, 181–191.
Reeves, G., Dallman, T., Redfern, O., Akpor, A. & Orengo, C. A. (2006). Structural diversity of domain superfamilies in the CATH database. J. Mol. Biol. 360, 725–741.
Rossmann, M. G. & Argos, P. (1975). A comparison of the heme binding pocket in globins and cytochrome b5. J. Biol. Chem. 250, 7525–7532.
Russell, R. B. & Barton, G. J. (1993). Multiple protein sequence alignment from tertiary structure comparisons. Assignments of global and residue level confidences. Proteins, 14, 309–323.
Sali, A. & Blundell, T. B. (1990). The definition of general topological equivalences in proteins: a procedure involving comparison of properties and relationships through simulated annealing and dynamic programming. J. Mol. Biol. 212, 403–428.
Sander, C. & Schneider, R. (1991). Database of homology-derived protein structures and structural meaning of sequence alignments. Proteins, 9, 56–68.
Taylor, W. R. & Orengo, C. A. (1989). Protein structure alignment. J. Mol. Biol. 208, 1–22.
Vriend, G. & Sander, C. (1991). Detection of common three-dimensional substructures in proteins. Proteins, 11, 552–558.

to end of page
to top of page