International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

International Tables for Crystallography (2006). Vol. G, ch. 3.1, pp. 83-85

Section 3.1.7. Composing new data definitions

B. McMahona*

aInternational Union of Crystallography, 5 Abbey Square, Chester CH1 2HU, England
Correspondence e-mail: bm@iucr.org

3.1.7. Composing new data definitions

| top | pdf |

Preceding sections have described the framework within which CIF dictionaries exist and are used, and their individual formal structures. While this is important for presenting the definition of new data items, it does not address what is often the most difficult question: what quantities, concepts or relationships merit separate data items? On the one hand, the extensibility of CIF provides great freedom of choice: anything that can be characterized as a separate idea may be assigned a new data name and set of attributes. On the other hand, there are practical constraints on designing software to write and read a format that is boundless in principle, and some care must be taken to organize new definitions economically and in an ordered way.

3.1.7.1. Granularity

| top | pdf |

Perhaps the most obvious decision that needs to be made is the level of detail or granularity chosen to describe the topic of interest. CIF data items may be very specific (the deadtime in microseconds of the detector used to measure diffraction intensities in an experiment) or very general (the text of a scientific paper). In general, a data name should correspond to a single well defined quantity or concept within the area of interest of a particular application. It can be seen that the level of granularity is determined by the requirements of the end application.

A practical example of determining an appropriate level of granularity is given by the core dictionary definitions for bibliographic references cited in a CIF. The dictionary originally contained a single character field, _publ_section_references, which was intended to contain the complete reference list for an article as undifferentiated text. Notes for Authors in journals accepting articles in CIF format advised authors to separate the references within the field with blank lines, but otherwise no structure was imposed upon the field. In a subsequent revision to the core dictionary, the much richer CITATION category was introduced to allow the structured presentation of references to journal articles and chapters of books. This was intended to aid queries to bibliographic databases. However, a full structured markup of references with multiple authors or editors in CIF requires additional categories, so that the details of the reference may be spread across three tables corresponding to the CITATION, CITATION_AUTHOR and CITATION_EDITOR categories. Populating several disjoint tables greatly complicates the author's task of writing a reference list. Moreover, the CITATION category does not yet cover all the many different types of bibliographic reference that it is possible to specify, and is therefore suitable only for references to journal articles and chapters of books. However, it is possible to write a program that can deduce the structure of a standard reference within an undifferentiated reference list (provided the journal guidelines have been followed by the author) to the extent that enough information can be extracted to add hyperlinks to references using a cross-publisher reference linking service such as CrossRef (CrossRef, 2004[link]). Therefore, in practice, IUCr journals still ask the author of an article to supply their reference list in the _publ_section_references field, rather than using the apparently more useful _citation_ fields. It remains to be seen whether this is the best strategy in the long term.

In more technical topic areas, the details of an experimental instrument could be described by a huge number of possible data names, ranging from the manufacturer's serial number to the colour of the instrument casing. However, many of these details are irrelevant to the analysis of the data generated by the instrument, so the characteristics of an instrument that are assigned individual data names are typically just those parameters that need to be entered in equations describing the calibration or interpretation of the data it generates.

3.1.7.2. Category `special details' fields

| top | pdf |

When the specific items in a particular topic area that need to be recorded under their own data names have been decided, there is likely to be other information that could be recorded, but is felt to be irrelevant to the immediate purposes of the data collection and analysis. It is good practice to provide a place in the CIF for such additional information; it encourages an author to record the infomation and permits data mining at a later stage. Each category typically contains a data name with the suffix _details (or _special_details) which identifies a text field in which additional information relating to the category may be stored. This field often contains explanatory text qualifying the information recorded elsewhere in the same category, but it might contain additional specific items of information for which no data name is given and for which no obvious application is envisaged. This helps to guard against the loss of information that might be put to good use in the future. Of course, if a *_details field is regularly used to store some specific item of information and this information is seen to be valuable in the analysis or interpretation of data elsewhere in the file, there is a case for defining a new, separate tag for this information.

3.1.7.3. Construction of data names

| top | pdf |

Since a dictionary definition contains all the machine-readable attributes necessary for validating the contents of a data field, the data name itself may be an arbitrary tag, devoid of semantic content. However, while dictionary-driven access to a CIF is useful in many cases, there are circumstances where it is useful to browse the file. It is therefore helpful to construct a data name in a way that gives a good indication of the quantity described. From the beginning, CIF data names have been constructed from self-descriptive components in an order that reflects the hierarchical relationship of the component ideas, from highest (most general) level to lowest (most specific) level when read from left to right.

In a typical example from the core CIF dictionary, the data name _atom_site_type_symbol defines a code (symbol) indicating the chemical nature (type) of the occupant of a location in the crystal lattice (atom_site). The equivalent data name from the mmCIF dictionary, _atom_site.type_symbol, explicitly separates the category to which the data name belongs from its more specific qualifiers by using a full stop (.) instead of an underscore (_). While this use of a full stop is mandated in DDL2 dictionaries, it should nevertheless be considered a convenience, since the category membership is explicitly listed in the dictionary definition frame for every data name.

However, it may not always be easy to establish the best order of components when constructing a new data name. In the JOURNAL category, there was initially some uncertainty about whether to associate the telephone numbers of different contact persons by appending codes such as _coeditor and _techeditor to a common base name. In the end, the order of components was reversed to give names like _journal_coeditor_phone and _journal_techeditor_phone. Examining the JOURNAL category in the core CIF dictionary will show why this was done. Similarly, the extension of geometry categories to include details of hydrogen bonding went through a stage of discussing adding new data names to the existing categories, but with suffixes indicating that the components were participating in hydrogen bonding, before it was decided that a completely new category for describing all elements of a hydrogen bond was justified. These examples show that the correct ordering of components within a data name is closely related to the perceived classification of data names by category and subcategory.

Sometimes it is useful to differentiate alternative data items by appending a suffix to a root data name. For example, the core dictionary defines several data names for recording the reference codes associated with a data block by different databases: _database_code_CAS, _database_code_CSD etc. This is convenient where there are two or three alternatives, but becomes unwieldy when the number of possibilities increases, because new data names need to be defined for each new alternative case. A better solution is to have a single base name and a companion data item that defines which of the available alternatives the base item refers to. The mmCIF dictionary follows this principle: the category DATABASE_2 contains two data names, _database_2.database_code (the value of which is an assigned database code) and _database_2.database_id (the value of which identifies which of the possible databases assigned the code) (Fig. 3.1.7.1[link]).

[Figure 3.1.7.1]

Figure 3.1.7.1 | top | pdf |

Alternative quantities described (a) by data-name extension (core dictionary) or (b) by paired data names (mmCIF dictionary).

Note the distinction between a data name constructed with a suffix indicating a particular database, and a data name which incorporates a prefix registered for the private use of a database. The data name _database_code_PDB is a public data name specifying an entry in the Protein Data Bank, while _pdb_database_code is a private data name used for some internal purpose by the Protein Data Bank (see Section 3.1.8.2[link]).

3.1.7.4. Parsable data values versus separate data names

| top | pdf |

An advantage of defining multiple data names for the individual components of a complicated quantity is that there is no ambiguity in resolving the separate components. Hence the Miller indices of a reflection in the list of diffraction measurements are specified in the core dictionary by the group of three data names _diffrn_refln_index_h, _diffrn_refln_index_k and _diffrn_refln_index_l. In principle, a single data name associated with the group of three values in some well defined format (e.g. comma separated, as h, k, l) could have been defined instead. However, this would require a parser to understand the internal structure of the value so that it could parse out the separate values for h, k and l.

On the other hand, there are many examples of data values that are stored as string values parsable into distinct components. An extreme example is the reference list mentioned in Section 3.1.7.1[link]. More common are dates ( _audit_creation_date), chemical formulae (e.g. _chemical_formula_moiety), symmetry operations ( _symmetry_equiv_pos_as_xyz) or symmetry transformation codes ( _geom_bond_site_symmetry_1). There is no definitive answer as to which approach is preferred in a specific case. In general, the separation of the components of a compound value is preferred when a known application will make use of the separate components individually. For instance, applications may list structure factors according to a number of ordering conventions on individual Miller indices. As an extreme example of separating the components of a compound value, the mmCIF dictionary defines data names for the standard uncertainty values of most of the measurable quantities it describes, while the core dictionary just uses the convention that a standard uncertainty is specified by appending an integer in parentheses to a numeric value.

When compound values are left as parsable strings, the parsing rules for individual data items need to be made known to applications. The DDL1 attribute _type_construct was envisaged as a mechanism for representing the components of a data value with a combination of regular expressions and reference to primitive data items, but this has not been implemented in existing CIF dictionaries (or in dictionary utility software). An alternative approach used in DDL2-based dictionaries defines within the dictionaries a number of extended data types (expressed in regular-expression notation through the attribute _item_type_list.code).

A related problem is how to handle data names that describe an indeterminate number of parameters. For example, in the modulated structures dictionary an extra eight Miller indices are defined to span a reciprocal space of dimension up to 11. In principle, the dimensionality could be extended without limit. According to the practice of defining a unique data name for each modulation dimension, new data names would need to be defined as required to describe higher-dimensional systems. Beyond a certain point this will become unwieldy, as will the set of data names required to describe the n2 components of the W matrix for a modulated structure of dimensionality n ( _cell_subsystem_matrix_W_1_1 etc.).

The modulated structures dictionary was constrained to define extended Miller indices in this way for compatibility with the core dictionary. Data names describing new quantities that are subject to similar unbounded extensibility should perhaps refer to values that are parsable into vector or matrix components of arbitrary dimension.

3.1.7.5. Consistency of abbreviations

| top | pdf |

One further consideration when constructing a data name is the use of consistent abbreviations within the components of the data name. This is of course a matter of style, since if a data name is fully defined in a dictionary with a machine-readable attribute set, the data name itself can be anything. Nonetheless, to help to find and group similar data names it is best to avoid too many different abbreviations.

Table 3.1.7.1[link] lists the abbreviations used in the current public dictionaries. Note that there are already cases where different abbreviations are used for the same term.

Table 3.1.7.1| top | pdf |
Abbreviations in CIF data names

Terms for which abbreviations are defined are sometimes found unabbreviated.

AbbreviationTermAbbreviationTermAbbreviationTerm
abbrev abbreviation eqn equation oper operation
abs absolute (configuration, not structure) esd standard uncertainty (estimated standard deviation) (see su) org organism
absorpt absorption orient orientation
alt alternative expt experiment origx orthogonal coordinate matrix (PDB files)
amp amplitude exptl experimental os operating system
AN accession number fom figure of merit param parameter
anal analyser fract fractional pd powder diffraction
aniso anisotropic Fsqd F squared PDB Protein Data Bank
anisotrop anisotropic gen generation PDF Powder Diffraction File
anom anomalous gen generator perp perpendicular
ASTM American Society for Testing and Materials gen genetic phos phosphate
asym asymmetric geom geometric pk peak
atten attenuation H-M Hermann–Mauguin polarisn polarization
au arbitrary units ha heavy atom poly polymer
auth author hbond hydrogen bond pos position
av average hist history prep preparation
ax axial horiz horizontal proc processed
B B form of atomic displacement parameter (a.d.p.) I intensity prof profile
ICSD Inorganic Crystal Structure Database prot protein
backgd background id identifier ptnr partner
beg begin illum illumination publ publication
bg background imag imaginary R agreement index
biol biology inc increment rad radius
bkg background incl include recd received
bond bonding info information recip reciprocal
Bsol B form of a.d.p. for solvent instr instrument ref reference
calc calculated Int international refine refinement
calib calibration (pd) ISBN International Standard Book Number refln reflection
cartn Cartesian iso isotropic reflns reflections
CAS Chemical Abstracts Service iso isomorphous res resolution
char characterization (pd) ISSN International Standard Serial Number restr restraints
chem chemical IUCr International Union of Crystallography rev revision
chir chirality IUPAC International Union of Pure and Applied Chemistry Rmerge agreement index of merging
clust cluster rms root mean square
coef coefficient len length rot rotation
com common lim limit S goodness of fit
comp component loc lack of closure samp sample
conc concentration ls least squares scat scattering factor
conf conformation max maximum seq sequence
config configuration MDF Metals Data File sigI σ(I)
conform conformant meanI mean intensity sigmaI σ(I)
conn connectivity meas measured sint [\sin\theta]
cons constant mid middle (between max and min) sint/lambda [\sin(\theta)/\lambda]
CSD Cambridge Structural Database min minimum sol solvent
db database mod modification spec specimen
defn definition mods modifications src source
detc detector mon monomer std standard
der derivative monochr monochromator (pd) stol [\sin(\theta)/\lambda]
dev standard deviation mono monochromator (pd) struct structure
dict dictionary nat natural su standard uncertainty
dif difference NBS National Bureau of Standards (now National Institute of Standards and Technology) suppl supplementary
diff difference sys systematic
diffr diffractometer tbar mean path length
diffrn diffraction NCA number of connected atoms temp temperature
displace displacement ncs noncrystallographic symmetry tor torsion angle
dist distance netI net intensity tran transformation
divg divergence NH number of connected hydrogen atoms transf transformation
dom domain nha non-hydrogen atoms transform transformation
dtime deadtime norm normal tvect translation vector (PDB files)
ens ensemble nst nonstandard vert vertical
eq equatorial nucl nucleic acid wR weighted agreement index
equat equatorial num number wt weight
equiv equivalent obs observed    
Terms with multiple definitions.








































to end of page
to top of page