InternationalReciprocal spaceTables for Crystallography Volume B Edited by U. Shmueli © International Union of Crystallography 2006 |
International Tables for Crystallography (2006). Vol. B, ch. 3.3, pp. 377-380
## Section 3.3.2. Molecular modelling, problems and approaches |

This section is concerned with software techniques which permit a set of atomic coordinates for a molecule to be generated *ab initio*, or to be modified, by reference to some chosen criterion, usually the electron density. Software that can change the shape of a molecule must be cognizant of the connectivity of the molecule and the bonding characteristics of atom types. It must also have means of regaining good stereochemistry if current coordinates are poor in this respect, or of performing its manipulations in ways which conserve essential stereochemical features. Approaches to some of these problems are outlined below. Many of the issues involved, including the topics outlined in Section 3.3.1.4 above, have been excellently reviewed by Hermans (1985), though with little reference to graphical aspects, and a comprehensive treatment of modelling methods based on energies is given by Burkert & Allinger (1982).

It is necessary to distinguish three different kinds of connectivity, namely structural, logical and drawing connectivities. Structural connectivity consists of the specification of the chemical bonding of the molecule and, as such, is an absolute property of the molecule. Logical connectivity consists of the specification of what part or parts of a molecule are moved, and in what way, if some stereochemical feature is altered. Logical and structural connectivity are closely related and in simple cases coincide, but the distinction is apparent, for example, if the puckering of a five-membered ring is being modelled by permitting folding of the pentagon about a line connecting non-adjacent corners. This line is then a logical connection between two moving parts, but it is not a feature of the structural connectivity.

Drawing connectivity consists of a specification of the lines to be drawn to represent the molecule and often coincides with the structural connectivity. However, stylized drawings, such as those showing the α-carbon atoms of a protein, require to be drawn with lines which are features neither of the logical nor of the structural connectivity.

The simplest means of storing connectivity information is by means of tables in which, for each atom, a list of indices of other atoms to which it is connected is stored. This approach is quite general; it may serve any type of molecular structure and permits structures to be traversed in a variety of ways. In this form, however, it is extravagant on storage because every connection is stored twice, once at each of the nodes it connects. It may, however, provide the starting material for the algorithm of Section 3.3.1.5.2 and its generality may justify its expense.

From such a list, lists of bonds, bond angles and dihedral angles may readily be derived in which each entry points to two, three or four atoms in the atom list. Lists of these three types form the basis of procedures which adjust the shape of a molecule to reduce its estimated potential energy (Levitt & Lifson, 1969; Levitt, 1974), and of search-and-retrieval techniques (Allen *et al.,* 1979).

Katz & Levinthal (1972) discuss the explicit specification of structural connectivity in terms of a tree structure in which, for each atom, is stored a single pointer to the connected atom nearer to the root, virtual atoms being used to allow ring structures to be treated as trees. An algorithm is also presented which allows such a tree specification to be redetermined if an atom in the tree is newly chosen as the root atom or if the tree itself is modified.

Cohen *et al.* (1981) have developed methods of handling connectivity in complicated fused- and bridged-ring systems.

In cases where software is required to deal only with a certain class of molecule, it may be possible to exploit the characteristics of that class to define an ordering for lists of atoms such that connectivity is implied by the ordering of items in the list. Such an ordering may successfully define one of the three types of connectivity defined in Section 3.3.2.1 but it is unlikely to be able to meet the needs of all three simultaneously. It may also be at a disadvantage when required to deal with structures not part of the class for which it is designed. Within these limitations, however, it may be exceedingly efficient. Both proteins and nucleic acids are of a class which permits their logical connectivity to be specified entirely by list ordering, and the software described in Section 3.3.3.2.6 uses no connectivity tables for this purpose. The ordering rules concerned are given by Diamond (1976*b*).

Drawing connectivity needs explicit specification in such a case; this may be done using only one 16-bit integer per atom, which may be stored as part of the atom list without the need of a separate table. This integer consists of two signed bytes which act as relative pointers in the list, positive pointers implying draw-to, negative pointers implying move-to. As each atom is encountered during drawing the right byte is read and utilized, and the two bytes are swapped before proceeding. This allows up to two bonds drawn to an atom and two bonds drawn from it, four in all, with a minimum of storage (Diamond, 1984*a*).

Brandenburg *et al.* (1981) handle drawing connectivity by enlarging the molecular list with duplicate atoms such that each is connected to the next in the list, but moves and draws still need to be distinguished.

Levitt (1971) has developed a syntax for specifying structural connectivity implicitly from a list structure which is very general, though designed with biopolymers in mind, and the work of Katz & Levinthal (1972) includes something similar.

Fundamental to the design of any software for molecular modelling are the choices of modelling criteria, and of parameterization. Criteria which may be adopted might include the fitting of electron density, the minimization of an energy estimate or the matching of complementary surfaces between a pair of molecules. Parameterizations which may be adopted include the use of Cartesian coordinates of atoms as independent variables, or of internal coordinates, such as dihedral angles, as independent variables with atomic positions being dependent on these. Systems designed to suit energetic criteria usually use Cartesian coordinates since all aspects of the structure, including bond lengths, must be treated as variables and be allowed to contribute to the energy estimate. Systems designed to fit a model to observed electron density, however, may adequately meet the stereochemical requirements of modelling on either parameterization, and examples of both types appear below.

Inputs to modelling systems vary widely. Systems intended for use mainly with proteins or other polymeric structures usually work with a library of monomers which the software may develop into a polymer. Systems intended for smaller molecules usually develop the molecular structure atom-by-atom rather than a residue at a time, and systems of this kind require a very general form of input. They may accept a list of atom types and coordinates if measurement and display of a known molecule is the objective, or they may accept `sketch-pad' input in the form of a hand-drawn two-dimensional sketch of the type conventional in chemistry, if the objective is the design of a molecule. Sketch-pad input is a feature of some systems with quantum-mechanical capabilities.

Suppose that **t** represents a vector from the current position of an atom in the model to a target position then (see Section 3.3.1.1.3), to first order, the observational equations are in which represents changes to conformational variables which may include dihedral angles, bond angles, bond lengths, and parameters determining overall position and orientation of the molecule as a whole. If every such parameter is included the model acquires 3*n* degrees of freedom for *n* atoms, in which case the methods of the next section are more appropriate, but if bond lengths and some or all bond angles are being treated as constants then the above equation becomes the basis of the treatment. in which is a unit vector defining the axis of rotation for an angular variable , and are position vectors of the atom *A* and the site of the parameter *P*, and represents a residual vector.

is minimized by in which More generally, if σ represents any scalar quantity which is to be minimized, *e.g.* an energy, then

It is beyond the scope of this chapter to review the methods available for evaluating from these equations. Difficulties may arise from two sources:

Difficulties of the first kind may be overcome by gradient methods, for example the conjugate gradient method without searches if ** M** is available or with searches if it is not available, or they may be overcome by methods based on eigenvalue decompositions. If non-linearity is serious less dependence should be placed on

**and gradient methods using searches are more valuable. In this connection Diamond (1966) introduced a sliding filter technique which produced rapid convergence in extreme conditions of non-linearity. These topics have been reviewed elsewhere (Diamond, 1981, 1984**

*M**b*) and are the subject of many textbooks (Walsh, 1975; Gill

*et al.*, 1981; Luenberger, 1984).

Warme *et al.* (1972) have developed a similar system using dihedral angles as variables and a variety of alternative optimization algorithms.

The modelling of flexible rings or lengths of chain with two or more fixed parts is sometimes held to be a difficulty in methods using conformational variables, although a simple two-stage solution does exist. The principle involved is the sectioning of the space of the variables into two orthogonal subspaces of which the first is used to satisfy the constraints and the second is used to perform the optimization subject to those constraints.

The algebra of the method may be outlined as follows, and is given in more detail by Diamond (1971, 1980*a*,*b*). Parametric shifts which satisfy the constraints are solutions of in which and depend only on the target vectors, , of the atoms with constrained positions and on the corresponding derivatives. We then find a partitioned orthogonal matrix satisfying in which are the eigenvectors of having positive eigenvalues, are those having zero eigenvalues, and and are arbitrary orthogonal matrices. Then in which the matrix to be inverted is positive definite. , however, is rectangular so that multiplying on the left by does not necessarily serve to determine , but we may write giving and is indeterminate and free to adopt any value. We therefore adopt which is the smallest vector of parametric shifts which will satisfy the constraints, and allow to be determined by the remaining observational equations in which the target vectors, **t**, are now modified to according to and being the derivatives and effective target vectors for the unconstrained atoms. We then solve in which is required to be of the form giving and apply the total shifts to obtain a structure which is optimized within the restrictions imposed by the constraints.

It may happen that is itself singular because there are insufficient data in the vector to control the structure and the parametric shifts contained in fully. In this event the same process may be applied again, basing the solution for on so that the vectors in represent the degrees of freedom which remain uncommitted. This method of application of constraints by subspace sectioning may be nested to any depth and is completely general.

A valid matrix may be found from by using the fact that the columns of are all linear combinations of the columns of and are void of any contribution from . It follows that may be found by using the columns of as priming vectors in the Gram–Schmidt process [Section 3.3.1.2.3 (i)] until the normalizing step involves division by zero. is then complete if all the columns of have been tried. may then be completed by using arbitrary vectors as primers.

Manipulation of a ring of *n* atoms may be achieved by treating it as a chain of atoms [having bond lengths, *n* bond angles and dihedral angles] in which atom 1 is required to coincide with atom and atom 2 with . then contains two vectors, namely the lack-of-closure vectors at these points, and is typically zero. is then found to have five columns corresponding to the five degrees of freedom of two points of fixed separation; contains only zeros if the ring is initially closed, and contains ring-closure corrections if, through non-linearity or otherwise, the ring has opened. contains columns if the chain of points has *p* variable parameters. It follows, if bond lengths and bond angles are treated as constants, that the seven-membered ring is the smallest ring which is flexible, that the six-membered ring (if it can be closed with the given bond angles) has no flexibility (though it may have discrete alternatives) and that it may be impossible to close a five-membered ring. Therefore some variation of bond angles and/or bond lengths is essential for the modelling of flexible five- and six-membered rings. Treating the ring as a chain of atoms is less satisfactory as there is then no control over the bond angle at the point of ring closure.

A useful concept for the modelling of flexible five-membered rings with near-constant bond angles is the concept of the pseudorotation angle *P*, and amplitude , for which the *j*th dihedral angle is given by This formulation has the property , which is not exactly true; nevertheless, values measured from observed conformations comply with this formulation within a degree or so (Altona & Sundaralingam, 1972).

Software specialized to the handling of condensed ring systems has been developed by van der Lieth *et al.* (1984) (Section 3.3.3.3.1) and by Cohen *et al.* (1981) (Section 3.3.3.3.2).

Modelling methods in which atomic coordinates are the independent variables are mathematically simpler than those using angular variables especially if the function to be minimized is a quadratic function of interatomic distances or of distances between atoms and fixed points. The method of Dodson *et al.* (1976) is representative of this class and it may be outlined as follows. If **d** is a column vector containing ideal values of the scalar distances from atoms to fixed target points or to other atoms, and if **l** is a column vector containing the prevailing values of these quantities obtained from the model, then in which the column matrix contains alterations to the atomic coordinates, contains residual discrepancies and ** D** is a large sparse rectangular matrix containing values of , of which there are not more than six non-zero values on any row, consisting of direction cosines of the line of which

**l**is the length. is then minimized by setting which they solve by the method of conjugate gradients without searches. This places reliance on the linearity of the observational equations (Diamond, 1984

*b*). It also works entirely with the sparse matrix , the dense matrix , and its inverse, being never calculated.

The method is extremely efficient in annealing a model structure for which an initial position for every atom is available, especially if the required shifts are within the quasi-linear region, but is less effective when large dihedral-angle changes are involved or when many atoms are to be placed purely by interpolation between a few others for which target positions are available. Interbond angles are controlled by assigning *d* values to second-nearest-neighbour distances; this is effective except for bond angles near 180° so that, in particular, planar groups require an out-of-plane dummy atom to be included which has no target position of its own but does have target values of distances between itself and atoms in the planar group. The method requires a *d* value to be supplied for every type of nearest- and next-nearest-neighbour distance in the structure, of which there are many, together with *W* values which are the inverse variances of the distances concerned as assessed by surveys of the corresponding distances in small-molecule structures, or from estimates of their accuracy, or from estimates of accuracy of the target positions.

Hermans & McQueen (1974) published a similar method which differs in that it moves only one atom at a time, in the environment of its neighbours, these being considered fixed while the central atom is under consideration. This is inefficient in the sense that in any one cycle one atom moves only a small fraction (∼3%) of the distance it will ultimately be required to move, but individual atom cycles are so cheap and simple that many cycles can be afforded. The method was selected for inclusion in *Frodo* by Jones (1978) (Section 3.3.3.2.7) and is an integral part of the *GRIP* system (Tsernoglou *et al.*, 1977; Girling *et al.*, 1980) (Section 3.3.3.2.2) for which it was designed.

Modelling methods which operate by minimizing an objective function of the coordinates (whether conformational or positional) suffer from the fact that any realistic objective function representing the potential energy of the structure is likely to have many minima in the space of the variables for any but the simplest problems. No general system has yet been devised that can ensure that the global minimum is always found in such cases, but we indicate here two approaches to this problem.

The first approach uses dynamics to escape from potential-energy minima. Molecular-mechanics simulations allow each atom to possess momentum as well as position and integrate the equations of motion, conserving the total energy. By progressively removing energy from the simulation by scaling down the momentum vectors some potential-energy minimum may be found. Conversely, a minimization of potential energy which has led to a minimum thought not to be the global minimum may be continued by introducing atomic momenta sufficient to overcome potential-energy barriers between minima, and subsequently attenuate the momenta again. In this way a number of minima may be found (Levitt & Warshel, 1975). It is equivalent to initializing a potential-energy minimization from a number of different conformations but it has the property that the minima so found are separated by energy barriers for which an upper limit is known so that the possibility exists of exploring transition pathways.

A second approach (Purisima & Scheraga, 1986) is relatively new. If the objective function to be minimized can be expressed in terms of interatomic distances, and if each atom is given coordinates in a space of dimensions for *n* atoms, then a starting structure may be postulated for which the interatomic distances *all* take their ideal values and the objective function is then necessarily at an absolute minimum. This multidimensional structure is then projected into a space of fewer dimensions, within which it is again optimized with respect to the objective function. The dimensionality of the model is thus progressively reduced until a three-dimensional model is attained at a low energy. This means that the minimum so attained in three dimensions is approached *from beneath*, having previously possessed a lower value in a higher-dimensional space. This, in itself, does not guarantee that the three-dimensional minimum-energy structure so found is at the global minimum, but it is not affected by energy barriers between minima in the same way, and it does appear to reach very low minima, and frequently the global one. Because it is formulated entirely in terms of interatomic distances it offers great promise for modelling molecules on the basis of data from nuclear magnetic resonance.

### References

Allen, F. H., Bellard, S., Brice, M. D., Cartwright, B. A., Doubleday, A., Higgs, H., Hummelink, T., Hummelink-Peters, B. G., Kennard, O., Motherwell, W. D. S., Rodgers, J. R. & Watson, D. G. (1979).*The Cambridge Crystallographic Data Centre: computer-based search, retrieval, analysis and display of information. Acta Cryst.*B

**35**, 2331–2339.

Altona, C. & Sundaralingam, M. (1972).

*Conformational analysis of the sugar ring in nucleosides and nucleotides. A new description using the concept of pseudorotation. J. Am. Chem. Soc.*

**94**(23), 8205–8212.

Brandenburg, N. P., Dempsey, S., Dijkstra, B. W., Lijk, L. J. & Hol, W. G. J. (1981).

*An interactive graphics system for comparing and model building of macromolecules. J. Appl. Cryst.*

**14**, 274–279.

Burkert, U. & Allinger, N. L. (1982).

*Molecular mechanics. ACS Monogr.*No. 177.

Cohen, N. C., Colin, P. & Lemoine, G. (1981).

*Script: interactive molecular geometrical treatments on the basis of computer-drawn chemical formula. Tetrahedron*,

**37**, 1711–1721.

Diamond, R. (1966).

*A mathematical model-building procedure for proteins. Acta Cryst.*

**21**, 253–266.

Diamond, R. (1971).

*A real-space refinement procedure for proteins. Acta Cryst.*A

**27**, 436–452.

Diamond, R. (1976

*b*).

*Model building techniques for macromolecules.*In

*Crystallographic computing techniques*, edited by F. R. Ahmed, K. Huml & B. Sedlacek, pp. 336–343. Copenhagen: Munksgaard.

Diamond, R. (1980

*a*).

*BILDER: a computer graphics program for biopolymers and its application to the interpretation of the structure of tobacco mosaic virus protein discs at 2.8 Å resolution.*In

*Biomolecular structure, conformation, function and evolution*, Vol. 1, edited by R. Srinivasan, pp. 567–588. Oxford: Pergamon Press.

Diamond, R. (1980

*b*).

*Some problems in macromolecular map interpretation.*In

*Computing in crystallography*, edited by R. Diamond, S. Ramaseshan & K. Venkatesan, pp. 21.01–21.19. Bangalore: Indian Academy of Sciences for the International Union of Crystallography.

Diamond, R. (1981).

*A review of the principles and properties of the method of least squares.*In

*Structural aspects of biomolecules*, edited by R. Srinivasan & V. Pattabhi, pp. 81–122. Delhi: Macmillan India Ltd.

Diamond, R. (1984

*a*).

*Applications of computer graphics in molecular biology. Comput. Graphics Forum*,

**3**, 3–11.

Diamond, R. (1984

*b*).

*Least squares and related optimisation techniques.*In

*Methods and applications in crystallographic computing*, edited by S. R. Hall & T. Ashida, pp. 174–192. Oxford University Press.

Dodson, E. J., Isaacs, N. W. & Rollett, J. S. (1976).

*A method for fitting satisfactory models to sets of atomic positions in protein structure refinements. Acta Cryst.*A

**32**, 311–315.

Gill, P. E., Murray, W. & Wright, M. H. (1981).

*Practical optimization.*Orlando, Florida: Academic Press.

Girling, R. L., Houston, T. E., Schmidt, W. C. Jr & Amma, E. L. (1980).

*Macromolecular structure refinement by restrained least-squares and interactive graphics as applied to sickling deer type III hemoglobin. Acta Cryst.*A

**36**, 43–50.

Hermans, J. (1985).

*Rationalization of molecular models.*In

*Methods in enzymology*, Vol. 115.

*Diffraction methods for biological molecules*, Part B, edited by H. W. Wyckoff, C. H. W. Hirs & S. N. Timasheff, pp. 171–189. Orlando, Florida: Academic Press.

Hermans, J. & McQueen, J. E. (1974).

*Computer manipulation of (macro) molecules with the method of local change. Acta Cryst.*A

**30**, 730–739.

Jones, T. A. (1978).

*A graphics model building and refinement system for macromolecules. J. Appl. Cryst.*

**11**, 268–272.

Katz, L. & Levinthal, C. (1972).

*Interactive computer graphics and representation of complex biological structures. Annu. Rev. Biophys. Bioeng.*

**1**, 465–504.

Levitt, M. (1971). PhD Dissertation, ch. 2. University of Cambridge, England.

Levitt, M. (1974).

*Energy refinement of hen egg-white lysozyme. J. Mol. Biol.*

**82**, 393–420.

Levitt, M. & Lifson, S. (1969).

*Refinement of protein conformations using a macromolecular energy minimization procedure. J. Mol. Biol.*

**46**, 269–279.

Levitt, M. & Warshel, A. (1975).

*Computer simulation of protein folding. Nature (London)*,

**253**, 694–698.

Lieth, C. W. van der, Carter, R. E., Dolata, D. P. & Liljefors, T. (1984).

*RINGS – a general program to build ring systems. J. Mol. Graphics*,

**2**, 117–123.

Luenberger, D. G. (1984).

*Linear and nonlinear programming.*Reading, Massachusetts: Addison Wesley.

Purisima, E. O. & Scheraga, H. A. (1986).

*An approach to the multiple-minima problem by relaxing dimensionality. Proc. Natl Acad. Sci. USA*,

**83**, 2782–2786.

Tsernoglou, D., Petsko, G. A., McQueen, J. E. & Hermans, J. (1977).

*Molecular graphics: application to the structure determination of a snake venom neurotoxin. Science*,

**197**, 1378–1381.

Walsh, G. R. (1975).

*Methods of optimization.*London: John Wiley.

Warme, P. K., Go, N. & Scheraga, H. A. (1972).

*Refinement of X-ray data of proteins. 1. Adjustment of atomic coordinates to conform to a specified geometry. J. Comput. Phys.*

**9**, 303–317.