International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

International Tables for Crystallography (2006). Vol. G, ch. 3.6, pp. 152-164

Section 3.6.6. Analysis

P. M. D. Fitzgerald,a* J. D. Westbrook,b P. E. Bourne,c B. McMahon,d K. D. Watenpaughe and H. M. Bermanf

aMerck Research Laboratories, Rahway, New Jersey, USA,bProtein Data Bank, Research Collaboratory for Structural Bioinformatics, Rutgers, The State University of New Jersey, Department of Chemistry and Chemical Biology, 610 Taylor Road, Piscataway, New Jersey, USA,cResearch Collaboratory for Structural Bioinformatics, San Diego Supercomputer Center, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0537, USA,dInternational Union of Crystallography, 5 Abbey Square, Chester CH1 2HU, England,eretired; formerly Structural, Analytical and Medicinal Chemistry, Pharmacia Corporation, Kalamazoo, Michigan, USA, and fProtein Data Bank, Research Collaboratory for Structural Bioinformatics, Rutgers, The State University of New Jersey, Department of Chemistry and Chemical Biology, 610 Taylor Road, Piscataway, New Jersey, USA
Correspondence e-mail:  paula_fitzgerald@merck.com

3.6.6. Analysis

| top | pdf |

The mmCIF dictionary contributes several new categories and data items to the REFINE and REFLN category groups. These reflect common practices in macromolecular crystallography in refinement and in the handling of experimental observations.

A new category group, the PHASING group, has been introduced to provide a structured description of phasing strategies, as macromolecular crystallography differs strongly from small-molecule crystallography in how phases are determined. The data model for phasing in the current version of the mmCIF dictionary cannot describe all approaches to phasing yet. Additions and revisions to the data items in the PHASING group of categories are anticipated in future versions of the dictionary.

3.6.6.1. Phasing

| top | pdf |

The categories describing phasing are as follows:

PHASING group
Overall description of phasing (§3.6.6.1.1[link])
 PHASING
Phasing via molecular averaging (§3.6.6.1.2[link])
 PHASING_AVERAGING
Phasing via isomorphous replacement (§3.6.6.1.3[link])
 PHASING_ISOMORPHOUS
Phasing via multiple-wavelength anomalous dispersion (§3.6.6.1.4[link])
 PHASING_MAD
 PHASING_MAD_CLUST
 PHASING_MAD_EXPT
 PHASING_MAD_RATIO
 PHASING_MAD_SET
Phasing via multiple isomorphous replacement (§3.6.6.1.5[link])
 PHASING_MIR
 PHASING_MIR_DER
 PHASING_MIR_DER_REFLN
 PHASING_MIR_DER_SHELL
 PHASING_MIR_DER_SITE
 PHASING_MIR_DER_SHELL
Phasing data sets (§3.6.6.1.6[link])
 PHASING_SET
 PHASING_SET_REFLN

The data items in the PHASING category group can be used to record details about the phasing of the structure and cover the various methods used in the phasing process. Many data items are provided for multiple isomorphous replacement (MIR) and multiple-wavelength anomalous dispersion (MAD). More limited sets of data items are provided for phasing using molecular averaging and phasing via using a structure that is isomorphous to the present structure. The current version of the mmCIF dictionary does not provide specific data items for recording the details of phasing via molecular replacement.

3.6.6.1.1. Overall description of phasing

| top | pdf |

The single data item in this category is as follows:

PHASING [Scheme scheme38]

The bullet ([\bullet]) indicates a category key.

Phasing of macromolecular structures often involves the application of more than one of the methods described in the PHASING section of the mmCIF dictionary, such as when phases generated from a multiple isomorphous replacement experiment are improved by molecular averaging. The PHASING category is used to list the methods that were used.

At present, the category contains a single data item, the purpose of which is to specify the method employed in the structure determination. It may have one or more of the values listed in the dictionary (Example 3.6.6.1[link]).

Example 3.6.6.1. The methods used to generate the phases for a hypothetical structure described with the data item in the PHASING category.

[Scheme scheme40]

3.6.6.1.2. Phasing via molecular averaging

| top | pdf |

The data items in this category are as follows:

PHASING_AVERAGING [Scheme scheme39]

The bullet ([\bullet]) indicates a category key. The arrow ([\rightarrow]) is a reference to a parent data item.

When more than one copy of a molecule is present in the asymmetric unit, phases can be improved by averaging an electron-density map over the multiple images of the molecule. In some special cases with very high noncrystallographic symmetry, de novo phases have been derived by iterative application of molecular averaging, but more often averaging is used to improve phases determined by another method.

There are many protocols used for phasing with averaging and they are very varied. It was not thought to be appropriate to specify data items for any one approach in the current version of the mmCIF dictionary. The data items that are provided allow a text-based description of the protocol to be given; a formalism for recording a fully parsable description of molecular averaging needs to be developed for future revisions of the dictionary.

Data items in the PHASING_AVERAGING category allow free-text descriptions to be given of the method used for structure determination or phase improvement using averaging over multiple observations of the molecule in the asymmetric unit and of any specific details of the application of the method to the current structure determination (Example 3.6.6.2[link]). Note that the reference to the method is to be used to describe the method itself, and not as a reference to a software package; references to software packages would be made using data items in the SOFTWARE category.

Example 3.6.6.2. Phase improvement with molecular averaging for a hypothetical structure described with data items in the PHASING_AVERAGING category.

[Scheme scheme41]

3.6.6.1.3. Phasing via isomorphous replacement

| top | pdf |

The data items in this category are as follows:

PHASING_ISOMORPHOUS [Scheme scheme42]

The bullet ([\bullet]) indicates a category key. The arrow ([\rightarrow]) is a reference to a parent data item.

Phases for many macromolecular structures are obtained from a previous determination of the same structure in the same crystal lattice. Examples of this are the determination of the structure of a point mutant or the determination of a structure in which a ligand is bound to an active site that was empty in the previous structure determination. In these cases, the new structure is essentially isomorphous with the parent structure, hence this method of phasing is termed `isomorphous phasing' in the mmCIF dictionary. It is not to be confused with multiple isomorphous phasing (MIR), a phasing technique that involves the use of heavy-atom derivatives. MIR phasing is discussed in Section 3.6.6.1.5[link].

Not much information is needed to characterize isomorphous phasing. The `parent' structure (the structure used to generate the initial phases for the present structure) is described in a free-text field and a second free-text field can be used to give details of the application of the method to the determination of the present structure (for instance, the removal of solvent or a bound ligand). In Example 3.6.6.3[link], the parent structure is the PDB entry 5HVP and the structure that is the subject of the present data block is identified as `HVP+CmpdA'. _phasing_isomorphous.method allows any formal techniques that were used in the application of the method to the present structure determination to be described, for example rigid-body refinement. Note that this data item is not to be used to reference a software package; this would be done using data items in the SOFTWARE category.

Example 3.6.6.3. Isomorphous replacement phasing of an HIV-1 protease structure described using data items in the PHASING_ISOMORPHOUS category.

[Scheme scheme43]

3.6.6.1.4. Phasing via multiple-wavelength anomalous dispersion

| top | pdf |

The data items in these categories are as follows:

(a) PHASING_MAD [Scheme scheme44]

(b) PHASING_MAD_CLUST [Scheme scheme45]

(c) PHASING_MAD_EXPT [Scheme scheme46]

(d) PHASING_MAD_RATIO [Scheme scheme47]

(e) PHASING_MAD_SET [Scheme scheme48]

The bullet ([\bullet]) indicates a category key. Where multiple items within a category are marked with a bullet, they must be taken together to form a compound key. The arrow ([\rightarrow]) is a reference to a parent data item.

PHASING_MAD and related categories are used to provide information about phasing using the multiple-wavelength anomalous dispersion (MAD) technique. The data model used for MAD phasing in the current version of the mmCIF dictionary is that of Hendrickson, as exemplified in the structure determination of N-cadherin (Shapiro et al., 1995[link]; Example 3.6.6.4[link]). In current practice, MAD phasing is often treated as a special case of MIR phasing and the PHASING_MIR categories would be more appropriate to describe the results.

Example 3.6.6.4. MAD phasing of the structure of N-cadherin (Shapiro et al., 1995[link]) described using data items in the PHASING_MAD and related categories.

[Scheme scheme49]

Unlike the PHASING_MIR categories, there is no provision in the current mmCIF model of MAD phasing for analysis of the overall phasing statistics and the contribution to the phasing of each data set by bins of resolution, and no provision for giving a list of the phased reflections. This will need to be addressed in future versions of the mmCIF dictionary.

The relationships between categories describing MAD phasing are shown in Fig. 3.6.6.1[link].

[Figure 3.6.6.1]

Figure 3.6.6.1 | top | pdf |

The family of categories used to describe MAD phasing. Boxes surround categories of related data items. Data items that serve as category keys are preceded by a bullet ([\bullet]). Lines show relationships between linked data items in different categories with arrows pointing at the parent data items.

Data items in the PHASING_MAD category allow a brief overview of the method that was used to be given and allow special aspects of the phasing strategy to be noted; data items in this category are analogous to the data items in the other overview categories describing phasing techniques.

In the data model for MAD phasing used in the present version of the mmCIF dictionary, a collection of data sets measured at different wavelengths can be used to construct more than one set of phases. These phase sets will produce electron-density maps with different local properties. The model of the structure is often constructed using information from a collection of these maps. The collections of multiple phase sets are referred to as `experiments' and the groups of data sets that contribute to each experiment are referred to as `clusters'. Data items in PHASING_MAD_EXPT identify each experiment and give the number of contributing clusters. Additional data items record the phase difference between the structure factors due to normal scattering from all atoms and from only the anomalous scatterers, the standard uncertainty of this quantity, the mean figure of merit, and a number of other indicators of the quality of the phasing.

Data items in the PHASING_MAD_CLUST category can be used to label the clusters of data sets and give the number of data sets allocated to each cluster. In Example 3.6.6.4[link] two experiments are described. The first experiment contains two clusters, one of which contains four data sets and the second of which contains five data sets. The second experiment contains a single cluster of five data sets. Note that the author has chosen informative labels to identify the clusters (`four wavelength', `five wavelength'). Carefully chosen labels can help someone reading the mmCIF to trace the complex relationships between the categories.

Data items in the PHASING_MAD_RATIO category can be used to record the ratios of phasing statistics (Bijvoet differences) between pairs of data sets in a MAD phasing experiment, within shells of resolution characterized by _phasing_MAD_ratio.d_res_high and *.d_res_low.

The data sets used in the MAD phasing experiments are described using data items in the PHASING_MAD_SET category. Each data set is characterized by resolution shell and wavelength, and by the [f'] and [f''] components of the anomalous scattering factor at that wavelength. The actual observations in each data set and the experimental conditions under which they were made are recorded using data items in the PHASING_SET and PHASING_SET_REFLN categories.

3.6.6.1.5. Phasing via multiple isomorphous replacement

| top | pdf |

The data items in these categories are as follows:

(a) PHASING_MIR [Scheme scheme50]

(b) PHASING_MIR_SHELL [Scheme scheme51]

(c) PHASING_MIR_DER [Scheme scheme52]

(d) PHASING_MIR_DER_REFLN [Scheme scheme53]

(e) PHASING_MIR_DER_SHELL [Scheme scheme54]

(f) PHASING_MIR_DER_SITE [Scheme scheme55]

The bullet ([\bullet]) indicates a category key. Where multiple items within a category are marked with a bullet, they must be taken together to form a compound key. The arrow ([\rightarrow]) is a reference to a parent data item. Data items marked with a plus (+) have companion data names for the standard uncertainty in the reported value, formed by appending the string _esd to the data name listed.

PHASING_MIR and related categories provide information about phasing by methods involving multiple isomorphous replacement (MIR). These same categories may also be used to describe phasing by related techniques, such as single isomorphous replacement (SIR) and single or multiple isomorphous replacement plus anomalous scattering (SIRAS, MIRAS). The relationships between the categories describing MIR phasing are shown in Fig. 3.6.6.2[link].

[Figure 3.6.6.2]

Figure 3.6.6.2 | top | pdf |

The family of categories used to describe MIR phasing. Boxes surround categories of related data items. Data items that serve as category keys are preceded by a bullet ([\bullet]). Lines show relationships between linked data items in different categories with arrows pointing at the parent data items.

As with the other overview categories described in this section, the PHASING_MIR category contains data items that can be used for text-based descriptions of the method used and any special aspects of its application. There are also items for describing the resolution limit of the reflections that were phased, the figures of merit for all reflections and for the acentric reflections phased in the native data set, and the total numbers of reflections and their inclusion threshold in the native data set. Statistics for the phasing can be given by shells of resolution using data items in the PHASING_MIR_SHELL category.

An MIR phasing experiment involves one or more derivatives. The remaining categories in this group are used to describe aspects of each derivative (Example 3.6.6.5[link]). A derivative in this context does not necessarily correspond to a data set; for instance, the same data set could be used to one resolution limit as an isomorphous scatterer and to a different resolution (and with a different sigma cutoff) as an anomalous scatterer. These would be treated as two distinct derivatives, although both derivatives would point to the same data sets via _phasing_MIR_der.der_set_id and _phasing_MIR_der.native_set_id (see Fig. 3.6.6.2[link]).

Example 3.6.6.5. Phasing of the structure of bovine plasma retinol-binding protein (Zanotti et al., 1993[link]) described using data items in the PHASING_MIR and related categories.

[Scheme scheme56]

Data items in the PHASING_MIR_DER category can be used to identify and describe each derivative. The resolution limits for the individual derivatives need not match those of the overall phasing experiment, as the phasing power of each derivative as a function of resolution will vary. Many of the statistical descriptors of phasing given in the PHASING_MIR category are repeated in this category, as derivatives vary in quality and their contribution to the phasing must be assessed individually. These same statistical measures can be given for shells of resolution in the PHASING_MIR_DER_SHELL category.

Data items in the PHASING_MIR_DER_REFLN category can be used to provide details of each reflection used in an MIR phasing experiment. The pointer _phasing_MIR_der_refln.set_id links the reflection to a particular set of experimental data and _phasing_MIR_der_refln.der_id points to a particular derivative used in the phasing (as mentioned above, derivatives in this context do not equate to data sets). The phase assigned to each reflection and the measured and calculated values of its structure factor can be given. (It is not necessary to include the measured values of the structure factors in this list, since they are accessible in the PHASING_SET_REFLN category, but it may be convenient to present them here). Data items are also provided for the A, B, C and D phasing coefficients of Hendrickson & Lattman (1970[link]).

The heavy atoms identified in each derivative can be listed using data items in the PHASING_MIR_DER_SITE category. Most of the data names are clear analogues of similar items in the ATOM_SITE category; an exception is _phasing_MIR_der_site.occupancy_anom, which specifies the relative anomalous occupancy of the atom type present at a heavy-atom site in a particular derivative.

3.6.6.1.6. Phasing data sets

| top | pdf |

The data items in these categories are as follows:

(a) PHASING_SET [Scheme scheme57]

(b) PHASING_SET_REFLN [Scheme scheme58]

The bullet ([\bullet]) indicates a category key. Where multiple items within a category are marked with a bullet, they must be taken together to form a compound key. The arrow ([\rightarrow]) is a reference to a parent data item.

Data items in the PHASING_SET family of categories are homologous to items with related names in the CELL and DIFFRN families of categories. The PHASING_SET categories were added to the mmCIF data model so that intensity and phase information for the data sets used in phasing could be stored in the same data block as the information for the refined structure. It is not necessary to store all the experimental information for each data set (e.g. the raw data sets or crystal growth conditions); it is assumed that the full experimental description of each phasing set would be recorded in a separate data block (see Example 3.6.6.6[link]).

Example 3.6.6.6. The phasing sets used in the structure determination of bovine plasma retinol-binding protein (Zanotti et al., 1993[link]) described with data items in the PHASING_SET and PHASING_SET_REFLN categories.

[Scheme scheme59]

Data items in the PHASING_SET category identify each set of diffraction data used in a phasing experiment and can be used to summarize relevant experimental conditions. Because a given data set may be used in a number of different ways (for example, as an isomorphous derivative and as a component of a multiple-wavelength calculation), it is appropriate to store the reflections in a category distinct from either the PHASING_MAD or PHASING_MIR family of categories, but accessible to both these families (and any similar categories that might be introduced later to describe new phasing methods). Figs. 3.6.6.1[link] and 3.6.6.2[link] show how reference is made to the relevant sets from within the PHASING_MAD and PHASING_MIR categories.

Each phasing set is given a unique value of _phasing_set.id. The other PHASING_SET data items record the cell dimensions and angles associated with each phasing set, the wavelength of the radiation used in the experiment, the source of the radiation, the detector type, and the ambient temperature.

Data items in the PHASING_SET_REFLN category are used to record the values of the measured structure factors and their uncertainties. Several distinct data sets may be present in this list, with reflections in each set identified by the appropriate value of _phasing_set_refln.set_id.

3.6.6.2. Refinement

| top | pdf |

The categories describing refinement are as follows:

REFINE group
Overall description of the refinement (§3.6.6.2.1[link])
 REFINE
 REFINE_FUNCT_MINIMIZED
Analysis of the refined structure (§3.6.6.2.2[link])
 REFINE_ANALYZE
Restraints and refinement by shells of resolution (§3.6.6.2.3[link])
 REFINE_LS_RESTR
 REFINE_LS_RESTR_NCS
 REFINE_LS_RESTR_TYPE
 REFINE_LS_SHELL
 REFINE_LS_CLASS
Equivalent atoms in the refinement (§3.6.6.2.4[link])
 REFINE_B_ISO
 REFINE_OCCUPANCY
History of the refinement (§3.6.6.2.5[link])
 REFINE_HIST

The macromolecular CIF dictionary contains many more data items for describing the refinement process than the core CIF dictionary does. In addition to new items in the REFINE category itself, additional categories have been introduced to describe in great detail the function minimized and the restraints applied, and the history of the refinement process, which often has many cycles. The REFINE_ANALYZE category can be used to give details of many of the quantities that may be used to assess the quality of the refinement. The REFINE_LS_SHELL category allows results to be reported by shells of resolution, and in effect replaces the more general core CIF category REFINE_LS_CLASS.

3.6.6.2.1. Overall description of the refinement

| top | pdf |

The data items in these categories are as follows:

(a) REFINE [Scheme scheme60]

(b) REFINE_FUNCT_MINIMIZED [Scheme scheme61]

The bullet ([\bullet]) indicates a category key. The arrow ([\rightarrow]) is a reference to a parent data item. Items in italics have aliases in the core CIF dictionary formed by changing the full stop (.) to an underscore (_) except where indicated by the [\sim] symbol. Data items marked with a plus (+) have companion data names for the standard uncertainty in the reported value, formed by appending the string _esd to the data name listed.

There is already an extensive set of data names in the REFINE category of the core dictionary, and Section 3.2.3.1[link] should be read with the present section. The only data items discussed in this section are entries in the mmCIF dictionary that do not have a counterpart in the core CIF dictionary. Analogues of a number of R factors in the core CIF dictionary have been added to the mmCIF dictionary to express these same R factors independently for the free and working sets of reflections. The remaining new data items have more specialized roles, which are discussed below.

The data item _refine.entry_id has been added to the REFINE category to provide the formal category key required by the DDL2 data model.

Many macromolecular structure refinements now use the statistical cross-validation technique of monitoring a `free' R factor (Brünger, 1997[link]). Rfree is calculated the same way as the conventional least-squares R factor, but using a small subset of reflections that are not used in the refinement of the structural model. Thus Rfree tests how well the model predicts experimental observations that are not themselves used to fit the model.

The mmCIF dictionary provides data names for Rfree and for the complementary Rwork values for the `working' set of reflections, which are the reflections that are used in the refinement. Separate data items are provided for unweighted and weighted versions of each R factor. A fixed percentage of the total number of reflections is usually assigned to the free group, and this percentage can be specified. Further details about the method used for selecting the free reflections can be given using _reflns.R_free_details. The estimated error in the Rfree value may also be given, along with the method used for determining its value.

The purposes of having a set of reflections that are not used in the refinement are to monitor the progress of the refinement and to ensure that the R factor is not being artificially reduced by the introduction of too many parameters. However, as the refinement converges, the working and free R factors both approach stable values. It is common practice, particularly in structures at high resolution, to stop monitoring Rfree at this point and to include all the reflections in the final rounds of refinement. It is thus worth noting a distinction between _refine.ls_R_factor_obs and _refine.ls_R_factor_R_work: _refine.ls_R_factor_obs relates to a refinement in which all reflections more intense than a specified threshold were used, while _refine.ls_R_factor_R_work relates to a refinement in which a subset of the observed reflections were excluded from the refinement and were used to calculate the free R factor. The dictionary allows the use of both values if a free R factor were calculated for most of the refinement, but all of the observed reflections were used in the final rounds of refinement; the protocol for this may be explained in _refine.details. When a full history of the refinement is provided using data items in the REFINE_HIST category, it is preferable to specify a change in protocol using data items in this category.

Other data items help to provide an assessment of the quality of the refinement. The scale-independent correlation coefficient between the observed and calculated structure factors may be recorded for the reflections included in the refinement using the data item _refine.correlation_coeff_Fo_to_Fc. There is a similar data item for the reflections that were not included in the refinement.

Overall standard uncertainties for positional and displacement parameters can be recorded according to a number of conventions. A maximum-likelihood residual for the positional parameters can be given using _refine.overall_SU_ML and the corresponding value for the displacement parameters can be given using _refine.overall_SU_B. Diffraction-component precision indexes for the displacement parameters based on the crystallographic R factor (the Cruickshank DPI; Cruickshank, 1999[link]) can be given using _refine.overall_SU_R_Cruickshank_DPI. The corresponding value for Rfree can be given using _refine.overall_SU_R_free.

The quality of a data set used for the refinement of a macromolecular structure is often given not only in terms of the scaling residuals, but also in terms of the data redundancy (the ratio of the number of reflections measured to the number of crystallographically unique reflections). Data items are provided to express the redundancy of all reflections, as well as those that have been marked as `observed' (i.e. exceeding the threshold for inclusion in the refinement). The percentage of the total number of reflections that are considered observed is another metric of the quality of the data set, and a data item is provided for this ( _refine.ls_percent_reflns_obs).

The limited resolution of many macromolecular data sets makes it inappropriate to refine anisotropic displacement factors for each atom. For these low- to medium-resolution studies, an overall anisotropic displacement model may be refined. The data items _refine.aniso_B* are provided for recording the unique elements of the matrix that describes the refined anisotropy.

The two-parameter method for modelling the contribution of the bulk solvent to the scattering proposed by Tronrud is used in several refinement programs. The data items _refine.solvent_model_* can be used to record the scale and displacement factors of this model, and any special aspects of its application to the refinement.

The average phasing figure of merit can be given for the working and free reflections. Unusually high or low values of displacement factors or occupancies can be a sign of problems with the refinement, so data items are provided to record the high, low and mean values of each. Further indicators of the quality of the refinement are found in the REFINE_ANALYZE category (Section 3.6.6.2.2[link]).

The data items in the REFINE_FUNCT_MINIMIZED category allow a brief description of the function minimized during refinement to be given (Example 3.6.6.7[link]). It is not possible to reconstruct the functioned minimized during the refinement by automatic parsing of the values of these data items, but the details given in them may still be helpful to someone reading the mmCIF.

Example 3.6.6.7. Results of the overall refinement of an HIV-1 protease structure (PDB 5HVP) described using data items in the REFINE and REFINE_FUNCT_MINIMIZED categories.

[Scheme scheme62]

3.6.6.2.2. Analysis of the refined structure

| top | pdf |

The data items in this category are as follows:

REFINE_ANALYZE [Scheme scheme63]

The bullet ([\bullet]) indicates a category key. The arrow ([\rightarrow]) is a reference to a parent data item.

In small-molecule crystallography, there is general agreement on the metrics that should be used to assess the quality of a structure determination, and data items in the REFINE category of the core CIF dictionary can be used to record them. For macromolecular structure determinations, no such agreement has been achieved yet and new metrics are frequently suggested as the field evolves. The REFINE_ANALYZE category can be used to record the metrics that were in common use at the time that the mmCIF dictionary was constructed; it is anticipated that new metrics will be added in future versions of the dictionary, and that some of the current metrics may fall into disuse.

Luzzati (1952[link]) devised a method for estimating the average positional shift that would be needed in an idealized refinement to reach an R factor of zero by using a plot of R factors against resolution. For some time, macromolecular crystallographers have used a modification of this approach to assess the average positional error. Recent practice has used Luzzati plots based on the free R values to yield a cross-validated error estimate. Data items are provided for recording these coordinate-error estimates and the range of resolution included in the plot (Example 3.6.6.8[link]). Related data names allow the specification of the value of [\sigma_a] used in constructing the Luzzati plot.

Example 3.6.6.8. Aspects of the refinement of an HIV-1 protease structure (PDB 5HVP) described with data items in the REFINE_ANALYZE category.

[Scheme scheme65]

A general feature of introducing more parameters in the model of the structure is a reduction in the R factor, but the statistical significance of this is often obscured by the simultaneous reduction in the ratio of observations to parameters. Attempts to extend Hamilton's (1965[link]) test to macromolecular structures are usually confounded by the use of restraints. Tickle et al. (1998[link]) proposed the use of a Hamilton generalized R factor analyzed separately for reflections in the working set (those used in the refinement) and for reflections in the free set (those set aside for cross validation), and these metrics are often reported in the literature. Data items are provided for recording the Hamilton generalized R factor for the working and free set of reflections, and for the ratio of the two.

Other indicators of a successful refinement involve the relative order of the model. Data items are provided for recording the sum of the occupancies of the hydrogen and non-hydrogen atoms in the model. The number of disordered residues may also be recorded.

3.6.6.2.3. Restraints and refinement by shells of resolution

| top | pdf |

The data items in these categories are as follows:

(a) REFINE_LS_RESTR [Scheme scheme64]

(b) REFINE_LS_RESTR_NCS [Scheme scheme66]

(c) REFINE_LS_RESTR_TYPE [Scheme scheme67]

(d) REFINE_LS_SHELL [Scheme scheme68]

(e) REFINE_LS_CLASS [Scheme scheme69]

The bullet ([\bullet]) indicates a category key. Where multiple items within a category are marked with a bullet, they must be taken together to form a compound key. The arrow ([\rightarrow]) is a reference to a parent data item.

These categories were introduced in the mmCIF dictionary to allow a detailed description of several aspects of structure refinement to be given. Data items in the REFINE_LS_RESTR category allow geometric restraints to be specified and the deviations of restrained parameters from ideal values in the final model to be given. The type of the geometric restraints can be described in more detail using data items in the REFINE_LS_RESTR_TYPE category. Data items in the REFINE_LS_RESTR_NCS category can be used to give information about any restraints on noncrystallographic symmetry used in the refinement and the category REFINE_LS_SHELL contains data items that allow the results of refinement to be given by shells of resolution.

Data items in the REFINE_LS_RESTR category can be used to record details about the restraints applied to various classes of parameters during least-squares refinement (Example 3.6.6.9[link]). It is clearly useful to tabulate the various classes of restraint, their deviation from ideal target values and the criteria used to reject parameters that lie too far from a target, as these data are often published as part of a description of the refinement and are often deposited with the coordinates in an archive. However, the types of restraints applied depend strongly on the software package used, and as new refinement packages regularly become available, it was clearly not advisable to provide program-specific data items in the mmCIF dictionary. The approach taken in the mmCIF dictionary has been to allow the value of _refine_ls_restr.type to be a free-text field, so that arbitrary labels can be given to restraints that are particular to a software package, but to recommend the use of specific labels for restraints applied by particular programs. The dictionary provides examples for labels specific to the programs PROTIN/PROLSQ (Hendrickson & Konnert, 1979[link]) and RESTRAIN (Driessen et al., 1989[link]). These program-specific representations have particular prefixes; thus the value p_bond_d is a bond-distance restraint as applied by PROTIN/PROLSQ. Values for _refine_ls_restr.type appropriate for other refinement programs may be suggested in future versions of the mmCIF dictionary.

Example 3.6.6.9. Results of the refinement of an HIV-1 protease structure (PDB 5HVP) described with data items in the REFINE_LS_RESTR and REFINE_LS_SHELL categories.

[Scheme scheme70]

Data items in the REFINE_LS_RESTR_TYPE category can be used to specify the ranges within which quantities are allowed to vary for each type of restraint. The special value indicated by a full stop (.) represents a restraint unbounded on the high or low side.

Data items in the REFINE_LS_RESTR_NCS category can be used to record details about the restraints applied to atom positions in domains related by noncrystallographic symmetry during least-squares refinement, and also to record the deviation of the restrained atomic parameters at the end of the refinement. The domains related by noncrystallographic symmetry are defined in the STRUCT_NCS_DOM and related categories (see Section 3.6.7.5.5[link]). The quantities that can be recorded for each restrained domain are the root-mean-square deviations of the displacement and positional parameters, and the weighting coefficients used in the noncrystallographic restraint of each type of parameter. Any special aspects of the way the restraints were applied may be described using _refine_ls_restr_ncs.ncs_model_details.

Data items in the REFINE_LS_SHELL category are used to summarize details of the results of the least-squares refinement by shells of resolution (Example 3.6.6.9[link]). The resolution range, in ångströms, forms the category key; for each shell the quantities reported, such as the number of reflections above the threshold for counting as significantly intense, are all defined in the same way as the corresponding data items used to describe the results of the overall refinement in the REFINE category.

The core dictionary category REFINE_LS_CLASS was introduced after the release of the first version of the mmCIF dictionary. It provides a more general way of describing the treatment of particular subsets of the observations, but it is not expected to be used in macromolecular structural studies, where partition by shells of resolution is traditional.

3.6.6.2.4. Equivalent atoms in the refinement

| top | pdf |

The data items in these categories are as follows:

(a) REFINE_B_ISO [Scheme scheme71]

(b) REFINE_OCCUPANCY [Scheme scheme72]

The bullet ([\bullet]) indicates a category key.

In macromolecular structure refinement, displacement factors or occupancies are often treated as equivalent for groups of atoms. An example would be the case where most of the atoms in the structure are refined with isotropic displacement factors, but a bound metal atom is allowed to refine anisotropically. Another example would be where the occupancies for all of the atoms in the protein part of a macromolecular complex are fixed at 1.0, but the occupancies of atoms in a bound inhibitor are refined. The REFINE_B_ISO and REFINE_OCCUPANCY categories can be used to record this information (Example 3.6.6.10[link]).

Example 3.6.6.10. The handling of displacement factors and occupancies during the refinement of an HIV-1 protease structure (PDB 5HVP) described with data items in the REFINE_B_ISO and REFINE_OCCUPANCY categories.

[Scheme scheme73]

Data items in the REFINE_B_ISO category can be used to record details of the treatment of isotropic B (displacement) factors during refinement. There is no formal link between the classes identified by _refine_B_iso.class and individual atom sites, although relationships may be inferred if the class names are carefully chosen. The category allows the treatment of the atoms in each class (isotropic, anisotropic or fixed) and the value assigned for fixed isotropic B factors to be recorded. Any special details can be given in a free-text field.

Data items in the REFINE_OCCUPANCY category can be used to record details of the treatment of occupancies of groups of atom sites during refinement. As with the treatment of displacement factors in the REFINE_B_ISO category, the classes itemized by _refine_occupancy.class are not formally linked to the individual atom sites, but the relationships may be deduced if the class names are chosen carefully.

3.6.6.2.5. History of the refinement

| top | pdf |

The data items in this category are as follows:

REFINE_HIST [Scheme scheme75]

The bullet ([\bullet]) indicates a category key.

Data items in the REFINE_HIST category can be used to record details about the various steps in the refinement of the structure. They do not provide as thorough a description of the refinement as can be given in other categories for the final model, but instead allow a summary of the progress of the refinement to be given and supported by a small set of representative statistics.

The category is sufficiently compact that a large number of cycles could be summarized, but it is not expected that every cycle of refinement would be routinely reported. Example 3.6.6.11[link] shows an entry for a single cycle of refinement. It is likely that an author would present a representative sequence of entries in a looped list.

Example 3.6.6.11. An example of one cycle of refinement described with data items in the REFINE_HIST category.

[Scheme scheme74]

3.6.6.3. Reflection measurements

| top | pdf |

The categories describing the reflections used in the refinement are as follows:

REFLN group
Individual reflections (§3.6.6.3.1[link])
 REFLN
 REFLN_SYS_ABS
Groups of reflections (§3.6.6.3.2[link])
 REFLNS
 REFLNS_SCALE
 REFLNS_SHELL
 REFLNS_CLASS

Data items in the REFLN category can be used to give information about the individual reflections that were used to derive the final model. The related category REFLN_SYS_ABS allows the reflections that should be systematically absent for the space group in which the structure was refined to be tabulated. Data items in the REFLNS category can be used to record information that applies to all of the reflections. Scale factors can be listed in the REFLNS_SCALE category, while the data items in REFLNS_SHELL can be used to record information about the reflection set by shells of resolution. The core CIF dictionary category REFLNS_CLASS, which can be used for a general classification of reflection groups according to criteria other than resolution shell, is not expected to be used in mmCIF applications.

3.6.6.3.1. Individual reflections

| top | pdf |

The data items in these categories are as follows:

(a) REFLN [Scheme scheme76]

(b) REFLN_SYS_ABS [Scheme scheme79]

The bullet ([\bullet]) indicates a category key. Where multiple items within a category are marked with a bullet, they must be taken together to form a compound key. The arrow ([\rightarrow]) is a reference to a parent data item. Items in italics have aliases in the core CIF dictionary formed by changing the full stop (.) to an underscore (_) except where indicated by the [\sim] symbol.

Data items in the REFLN category are used in the same way in the mmCIF and core CIF dictionaries, and Section 3.2.3.2.1[link] can be consulted for details. However, in macromolecular crystallography it is not usual for reflection intensities to be given in units of electrons (the units specified by the core CIF dictionary). Thus it was necessary to introduce in the mmCIF dictionary data items for the magnitudes of structure factors and their A and B components in arbitrary units (Example 3.6.6.12[link]). A figure of merit ( _refln.fom) can also be included for reflections that were phased using experimental methods.

Example 3.6.6.12. Part of the reflection list for an HIV-1 protease structure (PDB 5HVP) described with data items in the REFLN category.

[Scheme scheme77]

The REFLN_SYS_ABS category allows the intensities of the reflections that should be systematically absent to be tabulated. The ratio of the intensity to its standard uncertainty, given in the data item _refln_sys_abs.I_over_sigmaI, can be used to assess whether the reflection is indeed absent. The decision as to whether it is absent is left to the user of the mmCIF and is not recorded in the mmCIF.

3.6.6.3.2. Groups of reflections

| top | pdf |

The data items in these categories are as follows:

(a) REFLNS [Scheme scheme80]

(b) REFLNS_SCALE [Scheme scheme81]

(c) REFLNS_SHELL [Scheme scheme82]

(d) REFLNS_CLASS [Scheme scheme83]

The bullet ([\bullet]) indicates a category key. Where multiple items within a category are marked with a bullet, they must be taken together to form a compound key. The arrow ([\rightarrow]) is a reference to a parent data item. Items in italics have aliases in the core CIF dictionary formed by changing the full stop (.) to an underscore (_) except where indicated by the [\sim] symbol.

Data items in the REFLNS category of the core CIF dictionary can be used to summarize the properties or attributes of the complete set of reflections used in refinement (Section 3.2.3.2.2[link] ). The mmCIF dictionary adds a number of data items to this category, including the formal category key required by the DDL2 data model. There are also data items for describing the data-reduction method and recording any relevant details about data reduction, and for giving an estimate of the overall Wilson B factor for the data set.

A number of the new data items relate to the issue of how reflections are flagged as being observed and are thus used in the refinement. In the core CIF dictionary, the criteria used to consider a reflection as being observed are given using the data item _reflns.observed_criterion. This is a free-text field so is not automatically parsable. Therefore it is supplemented in the mmCIF dictionary by data items that can be used to stipulate the criterion in terms of the values of F, I or the uncertainties in these quantities (Example 3.6.6.13[link]). The percentage of the total number of reflections that meet the criterion can be recorded.

Example 3.6.6.13. The data set used in the refinement of an HIV-1 protease structure (PDB 5HVP) described using data items in the REFLNS and REFLNS_SHELL categories.

[Scheme scheme84]

Data items are also provided for describing the selection of the reflections used to calculate the free R factor, and for giving the Rmerge values for all reflections and for the subset of `observed' reflections. Data items in the REFLNS_SCALE and REFLNS_SHELL categories are used in the same way in the mmCIF and core CIF dictionaries, and Section 3.2.3.2.2[link] can be consulted for details.

As with the related categories DIFFRN_REFLNS_CLASS and REFINE_LS_CLASS, the core dictionary category REFLNS_CLASS was introduced after the release of the first version of the mmCIF dictionary. It provides a more general way of describing the treatment of particular subsets of the observations, but it is not expected to be used in macromolecular structural studies, where partition by shells of resolution is traditional.

References

Brünger, A. T. (1997). Free R value: cross-validation in crystallography. Methods Enzymol. 277, 366–396.
Cruickshank, D. W. J. (1999). Remarks about protein structure precision. Acta Cryst. D55, 583–601.
Driessen, H., Haneef, M. I. J., Harris, G. W., Howlin, B., Khan, G. & Moss, D. S. (1989). RESTRAIN: restrained structure-factor least-squares refinement program for macromolecular structures. J. Appl. Cryst. 22, 510–516.
Hamilton, W. C. (1965). Significance tests on the crystallographic R factor. Acta Cryst. 18, 502–510.
Hendrickson, W. A. & Konnert, J. H. (1979). Stereochemically restrained crystallographic least-squares refinement of macromolecule structures. In Biomolecular structure, conformation, function and evolution, edited by R. Srinavisan, Vol. I, pp. 43–57. New York: Pergamon Press.
Hendrickson, W. A. & Lattman, E. E. (1970). Representation of phase probability distributions for simplified combination of independent phase information. Acta Cryst. B26, 136–143.
Luzzati, V. (1952). Traitement statistique des erreurs dans la determination des structures cristallines. Acta Cryst. 5, 802–810.
Shapiro, L., Fannon, A. M., Kwong, P. D., Thompson, A., Lehmann, M. S., Grubel, G., Legrand, J. F., Als-Nielsen, J., Colman, D. R. & Hendrickson, W. A. (1995). Structural basis of cell–cell adhesion by cadherins. Nature (London), 374, 327–337.
Tickle, I. J., Laskowski, R. A. & Moss, D. S. (1998). Rfree and the Rfree ratio. I. Derivation of expected values of cross-validation residuals used in macromolecular least-squares refinement. Acta Cryst. D54, 547–557.
Zanotti, G., Berni, R. & Monaco, H. L. (1993). Crystal structure of liganded and unliganded forms of bovine plasma retinol-binding protein. J. Biol. Chem. 268, 10728–10738.








































to end of page
to top of page