Tables for
Volume F
Crystallography of biological macromolecules
Edited by E. Arnold, D. M. Himmel and M. G. Rossmann

International Tables for Crystallography (2012). Vol. F, ch. 23.2, pp. 752-754   | 1 | 2 |

Chapter 23.2. Locating domains in three-dimensional structures

L. Holma* and C. Sanderb

aEMBL–EBI, Cambridge CB10 1SD, England, and bMIT Center for Genome Research, One Kendall Square, Cambridge, MA 02139, USA
Correspondence e-mail:

The assignment of protein domains from three-dimensional structure is critically important in understanding protein evolution and function. Domains are quasi-independent substructures that are thought to fold autonomously, to carry specific molecular functions, to move relative to each other as semi-rigid bodies and to speed the evolution of new functions by recombination. The concepts underlying computational methods for locating domains in three-dimensional structures are presented. Early algorithms focused on physical criteria to identify compact subunits. With the growth of the structural database, the focus has shifted to methods for identifying recurrent substructures that can form the basis for a consistent protein-structure classification.

23.2.1. Introduction

| top | pdf |

Modular design is beneficial in many areas of life, including computer progr­am­ming, manufacturing, and even in protein folding.

Protein-structure analysis has long operated with the notion of domains, i.e., dividing large structures into quasi-independent substructures or modules (Wetlaufer, 1973[link]; Bork, 1992[link]). In various contexts, these substructures are thought to fold autonomously, to carry specific molecular functions such as binding or catalysis, to move relative to each other as semi-rigid bodies and to speed the evolution of new functions by recombination (Fig.[link].


Figure | top | pdf |

The structure of diphtheria toxin (Bennett & Eisenberg, 1994[link]) beautifully illustrates domains as structural, functional and evolutionary units. Structurally, note the compact globular shape of each domain and the flexible linkers between them. Functionally, note how each domain carries out a different stage of infection by the bacterium: receptor binding, membrane penetration and ADP-ribosylation of the target protein. Evolutionarily, note the occurrence of domains homologous to the catalytic domain of diphtheria toxin in exo-, entero- and pertussis toxins, and in poly-ADP-ribose polymerase (Holm & Sander, 1999[link]). Arrows point to recurrent substructures in structural neighbours (Lionetti et al., 1991[link]; Li et al., 1996[link]; Tormo et al., 1996[link]) of each domain of diphtheria toxin. Drawn using MOLSCRIPT version 2 (Kraulis, 1991[link]).

The problem of subdividing protein molecules into structural and functional units has received the attention of numerous researchers over the last 25 years. Early algorithms focused on protein folding or unfolding pathways and aimed at identifying substructures that would be physically stable on their own. Nowadays, with bulging macromolecular databases, the focus has shifted to devise automatic methods for identifying domains that can form the basis for a consistent protein-structure classification (Murzin et al., 1995[link]; Orengo et al., 1997[link]; Holm & Sander, 1999[link]).

This review presents the concepts underlying computational methods for locating domains in three-dimensional structures. Those interested in implementations are referred to the web services of the European Bioinformatics Institute1 and related sites.

23.2.2. Compactness

| top | pdf |

A variety of ingenious techniques have been invented for locating structural domains in 3D structures. These include inspection of distance maps, clustering, neighbourhood correlation, plane cutting, interface area minimization, specific volume minimization, searching for mechanical hinge points, maximization of compactness and maximization of buried surface area (Rossmann & Liljas, 1974[link]; Rashin, 1976[link]; Crippen, 1978[link]; Nemethy & Scheraga, 1979[link]; Rose, 1979[link]; Schulz & Schirmer, 1979[link]; Go, 1981[link]; Lesk & Rose, 1981[link]; Sander, 1981[link]; Wodak & Janin, 1981[link]; Zehfus & Rose, 1986[link]; Kikuchi et al., 1988[link]; Moult & Unger, 1991[link]; Holm & Sander, 1994[link]; Zehfus, 1994[link]; Islam et al., 1995[link]; Siddiqui & Barton, 1995[link]; Swindells, 1995[link]; Holm & Sander, 1996[link]; Sowdhamini et al., 1996[link]; Zehfus, 1997[link]; Holm & Sander, 1998[link]; Jones et al., 1998[link]; Wernisch et al., 1999[link]).

Common to most approaches are the assumptions that folding units are compact and that the interactions between them are weak. These notions can be made quantitative, for example, by counting interatomic contacts and by locating domain borders by identifying groups of residues such that the number of contacts between groups is minimized. The hierarchic organization of putative folding units can be inferred starting from the complete structure and recursively cutting it (in silico) into smaller and smaller substructures. Alternatively, one may start from the residue or secondary-structure-element level and successively associate the most strongly interacting groups. The procedure involves two optimization problems.

The first optimization problem is algorithmic and concerns finding the optimal subdivisions. This problem is complicated by the possibility of the chain passing several times between domains (discontinuous domains). Without the constraint of sequential continuity, there is a combinatorial number of possibilities for dividing a set of residues into subsets (Zehfus, 1994[link]). This hurdle has been overcome by fast heuristics (Holm & Sander, 1994b[link]; Zehfus, 1997[link]; Wernisch et al., 1999[link]).

The second optimization problem concerns formulating physical criteria that distinguish between autonomous and nonautonomous folding units, i.e., defining termination criteria for recursive algorithms. Since compactness-related criteria do not have a clear bimodal distribution, domain-assignment algorithms (Holm & Sander, 1994b[link]; Islam et al., 1995[link]; Siddiqui & Barton, 1995[link]; Swindells, 1995[link]; Sowdhamini et al., 1996[link]; Wernisch et al., 1999[link]) use cutoff parameters that have been fine-tuned against an external reference set of domain definitions.

23.2.3. Recurrence

| top | pdf |

Most fold classifications use a hierarchical model where evolutionary families are a subcategory of fold type and it is natural to assume that domain boundaries should be conserved in evolution. Consistency concerns lead to a reformulation of the goals of the domain-assignment problem, away from (imprecise) physical models of stable folding units and towards recognizing such units phenomenologically in the database of known structures through recurrence. The concept of recurrence has long been the cornerstone of domain assignments by experts based on visual inspection (Richardson, 1981[link]). Recurrence means recognizing architectural units in one protein that have already been defined (named) in another.

The practical importance of domain identification is illustrated by the discoveries made by a systematic structure comparison of recurrent domains between histidine triad (HIT) proteins and galactose-6-phosphate uridylyltransferase [homodimer and internally duplicated common catalytic core, respectively (Holm & Sander, 1997[link])], and between beta-glucosyltransferase and glycogen phosphorylase [bare and heavily decorated common catalytic core, respectively (Holm & Sander, 1995[link]; Artymiuk et al., 1995[link])], even though the contours of the molecules look quite different.

Let us restate the goal of domain identification as an economic description of all known protein structures in terms of a small set of large substructures. This is an intuitive goal and conceptually related to the principle of minimal encoding in information theory. The key ingredients of the optimization problem are the gain associated with reusing a substructure and the cost associated with using many small substructures to describe a protein. An analogy in writing is that copying blocks of text is cheap, but for coherence some thought and effort is necessary for bridging the blocks.

With a suitably defined cost function, recurrence can be used to select an optimal set of substructures from the hierarchic folding or unfolding trees generated using compactness criteria. Thus, the unsatisfactorily solved problem of defining termination criteria for compactness algorithms can be turned into an optimization problem that does not rely on any external reference and leads to an internally consistent set of domain definitions.

The key difficulty is in quantifying the notion of economy so that it leads to a selection of substructures of `appropriate' size, i.e., globular folds and not, for example, supersecondary-structure motifs. One solution, which is physical nonsense but has the desired qualitative behaviour, is a heuristic objective function used in the DALI domain dictionary (Holm & Sander, 1998[link]). Recurrence is quantified in terms of the statistical significance of structural similarity for many pairs of substructures. The statistical significance is highest for structural similarities that involve large units and that completely cover a substructure unit. Exploiting these effects, a sum-of-pairs objective function is defined that favours recurrences of large substructures with distinct topological arrangements and packing of secondary-structure elements, and disfavours small substructures consisting of one or two secondary-structure elements despite their higher frequency of recurrence. Though other formulations of the optimization problem are possible, this empirically chosen objective function combined with a heuristic algorithm for optimization yields a useful set of substructures (domains).

23.2.4. Conclusion

| top | pdf |

While we do not foresee that automatically delineated domains will be accepted as the gold standard of the trade, modern methods, based on a combination of recurrence and compactness criteria, yield domain definitions that are consistent within protein families and often coincide with biologically functional units, recover the well known folding topologies with many members, produce clusters with good coverage of common secondary-structure elements, and provide a useful basis for large-scale structure analysis and classification.


Artymiuk, P. J., Rice, D. W., Poirrette, A. R. & Willett, P. (1995). Beta-glucosyltransferase and phosphorylase reveal their common theme. Nat. Struct. Biol. 2, 117–120.
Bennett, M. J. & Eisenberg, D. (1994). Refined structure of monomeric diphtheria toxin at 2.3 Å resolution. Protein Sci. 3, 1464–1475.
Bork, P. (1992). Mobile modules and motifs. Curr. Opin. Struct. Biol. 2, 413–421.
Crippen, G. (1978). The tree structural organization of proteins. J. Mol. Biol. 126, 315–332.
Go, M. (1981). Correlation of DNA exonic regions with protein structural units in hemoglobin. Nature (London), 291, 90–92.
Holm, L. & Sander, C. (1994). Parser for protein folding units. Proteins, 19, 256–268.
Holm, L. & Sander, C. (1995). Evolutionary link between glycogen phosphorylase and a DNA modifying enzyme. EMBO J. 14, 1287–1293.
Holm, L. & Sander, C. (1996). Mapping the protein universe. Science, 273, 595–602.
Holm, L. & Sander, C. (1997). Enzyme HIT. Trends Biochem. Sci. 22, 116–117.
Holm, L. & Sander, C. (1998). Dictionary of recurrent domains in protein structures. Proteins, 33, 88–96.
Holm, L. & Sander, C. (1999). Protein folds and families: sequence and structure alignments. Nucleic Acids Res. 27, 244–247.
Islam, S. A., Luo, J. & Sternberg, M. J. (1995). Identification and analysis of domains in proteins. Protein Eng. 8, 513–525.
Jones, S., Stewart, M., Michie, A. D., Swindells, M. B., Orengo, C. A. & Thornton, J. M. (1998). Domain assignment for protein structures using a consensus approach: characterisation and analysis. Protein Sci. 7, 233–242.
Kikuchi, T., Nemethy, G. & Scheraga, H. A. (1988). Prediction of the location of structural domains in globular proteins. J. Protein Chem. 88, 427–471.
Kraulis, P. J. (1991). MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. J. Appl. Cryst. 24, 946–950.
Lesk, A. M. & Rose, G. D. (1981). Folding units in globular proteins. Proc. Natl Acad. Sci. USA, 78, 4304–4308.
Li, M., Dyda, F., Benhar, I., Pastan, I. & Davies, D. R. (1996). Crystal structure of the catalytic domain of Pseudomonas exotoxin A complexed with a nicotinamide adenine dinucleotide analog: implications for the activation process and for ADP ribosylation. Proc. Natl Acad. Sci. USA, 93, 6902–6906.
Lionetti, C., Guanziroli, M. G., Frigerio, F., Ascenzi, P. & Bolognesi, M. (1991). X-ray crystal structure of the ferric sperm whale myoglobin: imidazole complex at 2.0 Å resolution. J. Mol. Biol. 217, 409–412.
Moult, J. & Unger, R. (1991). An analysis of protein folding pathways. Biochemistry, 30, 3816–3824.
Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: a structural classification of the protein database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540.
Nemethy, G. & Scheraga, H. A. (1979). A possible folding pathway of bovine pancreatic Rnase. Proc. Natl Acad. Sci. USA, 76, 6050–6054.
Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. & Thornton, J. M. (1997). CATH – a hierarchic classification of protein domain structures. Structure, 5, 1093–1108.
Rashin, A. A. (1976). Location of domains in globular proteins. Nature (London), 291, 85–87.
Richardson, J. S. (1981). The anatomy and taxonomy of protein structure. Adv. Protein Chem. 34, 167–339.
Rose, G. D. (1979). Hierarchic organization of domains in globular proteins. J. Mol. Biol. 134, 447–470.
Rossmann, M. & Liljas, A. (1974). Recognition of structural domains in globular proteins. J. Mol. Biol. 85, 177–181.
Sander, C. (1981). Physical criteria for folding units of globular proteins. In Structural Aspects of Recognition and Assembly in Biological Macromolecules, Vol. I. Proteins and Protein Complexes, Fibrous Proteins, edited by M. Balaban, pp. 183–195. Jerusalem: Alpha Press.
Schulz, G. E. & Schirmer, H. (1979). Principles of Protein Structure, ch. 5. New York: Springer Verlag.
Siddiqui, A. S. & Barton, G. J. (1995). Continuous and discontinuous domains: an algorithm for the automatic generation of reliable protein domain definitions. Protein Sci. 4, 872–884.
Sowdhamini, R., Rufino, S. D. & Blundell, T. L. (1996). A database of globular protein structural domains: clustering of representative family members into similar folds. Structure Fold. Des. 1, 209–220.
Swindells, M. B. (1995). A procedure for detecting structural domains in proteins. Protein Sci. 4, 103–112.
Tormo, J., Lamed, R., Chirino, A. J., Morag, E., Bayer, E. A., Shoham, Y. & Steitz, T. A. (1996). Crystal structure of a bacterial family-III cellulose-binding domain: a general mechanism for attachment to cellulose. EMBO J. 15, 5739–5751.
Wernisch, L., Hunting, M. & Wodak, J. (1999). Identification of structural domains in proteins by a graph heuristic. Proteins, 35, 338–352.
Wetlaufer, D. B. (1973). Nucleation, rapid folding, and globular intrachain regions in proteins. Proc. Natl Acad. Sci. USA, 70, 697–701.
Wodak, J. & Janin, J. (1981). Location of structural domains in proteins. Biochemistry, 20, 6544–6552.
Zehfus, M. H. (1994). Binary discontinuous compact protein domains. Protein Eng. 7, 335–340.
Zehfus, M. H. (1997). Identification of compact, hydrophobically stabilized domains and modules containing multiple peptide chains. Protein Sci. 6, 1210–1219.
Zehfus, M. H. & Rose, G. D. (1986). Compact units in proteins. Biochemistry, 25, 5759–5765.

to end of page
to top of page