Tables for
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

International Tables for Crystallography (2006). Vol. G, ch. 3.1, pp. 88-89

Section 3.1.9. Composite dictionaries

B. McMahona*

aInternational Union of Crystallography, 5 Abbey Square, Chester CH1 2HU, England
Correspondence e-mail:

3.1.9. Composite dictionaries

| top | pdf |

The dictionaries referenced by a data file are those that contain the definitions of the data names used in the data file. Typically these include or consist entirely of public dictionaries that are necessarily permissive in the range of values allowed for data items. However, the power and flexibility of validating against machine-readable dictionaries could be harnessed by applications that need to impose stricter validation criteria. For example, the core dictionary permits an enumeration range of 0 to 8 for _atom_site_attached_hydrogens, but one might wish to validate a data set describing well behaved organic molecules where anything above 4 is almost certainly an error. It would be helpful to have a validation dictionary identical to the core dictionary except for this enumeration range; however it would be inefficient to create an alternative dictionary of the same size simply for this one change. In Section[link], we consider how to build a dictionary file that includes the bulk of the content of the public dictionaries cited in the CIF, together with modifications in local dictionary files to allow alternative specifications of what constitutes a `valid' data item.

Proper applications of this approach include restricting the enumeration range specified for an item in a public dictionary; enforcing a more strict data typing than allowed by the parent dictionary; storing a list of all data names (including local ones) permitted in a CIF; or adding to existing dictionary entries references to local data items in an extension dictionary. An example of the latter application would be the addition of a _list_link_child entry to a public definition to accompany the introduction of a new child category in a local dictionary. The protocol to be described does not prohibit other applications, but care must be taken to generate dictionaries that retain internal consistency and are properly parsable by standard validation tools. A dictionary merging protocol

| top | pdf |

The following protocol describes the construction of a composite, or virtual, dictionary by merging and overlaying fragments of a local validation dictionary and the public dictionaries referenced from within a data file. The term `dictionary fragment' refers here to a physical disk file which contains one or more data blocks or save frames (according to whether the relevant data model is DDL1 or DDL2) containing complete or partial sets of attributes associated with data names identified in the relevant dictionary data block or save frame through the item _name (DDL1) or (DDL2).

(i) Assemble and load all dictionary fragments against which the current data block will be validated. The order of presentation is important. Complete dictionaries referenced by a data file should be assembled in the order cited. A dictionary validation application may then accept a list of additional dictionary fragments to PREPEND to, REPLACE or APPEND to each file in the list of cited dictionaries. In most applications, it will be appropriate to append to or replace attributes defined in a public dictionary, and the PREPEND operation is presented only for completeness.

(ii) Define three modes in which conflicting data names in the aggregate dictionary file may be resolved, called STRICT, REPLACE and OVERLAY.

(iii) Scan the aggregate dictionary fragments in the order of loading. Assemble for each defined data name a composite definition frame (data block or save frame as appropriate) as follows, depending on the mode in which the validation application is operating:

STRICT: If a data name appears to be multiply defined, generate a fatal error. This mode permits the interpolation of local dictionaries that do not attempt to modify the attributes of public data items.

REPLACE: All attributes previously stored for the conflicting data name are deleted, and only the attributes in the later data block (or save frame) containing the definition are preserved. This mode permits the complete redefinition of public data names and is not appropriate for validation of CIFs to be archived. Its main use would be in testing modifications of individual definition frames outside the parent dictionary.

OVERLAY: New attributes are added to those already stored for the data name; conflicting attributes replace those already stored. This is the standard mechanism for modifying attributes for application-specific validation purposes.

This protocol allows the creation of a coherent virtual dictionary from several different dictionary files or fragments. Although it must be used with care, it permits different levels of validation based on dictionary-driven methods without modifying the original dictionary files themselves.

As an example, consider the core CIF dictionary definition mentioned above of the number of hydrogen atoms that might be attached to an atom site (Example[link]).

Example A standard CIF dictionary definition block.

[Scheme scheme21]

For a particular application, any structures reporting more than four attached hydrogen atoms might be considered as invalid. A validation program to satisfy this requirement might therefore build a composite dictionary from the public cif_core.dic, which contains the definition in Example[link], and the fragment of Example[link], processed in APPEND/OVERLAY modes.

Example A modified data attribute for overlaying a public definition.

[Scheme scheme22] Protocol implementation

| top | pdf |

At the time of publication (2005), there is no reference implementation for this protocol, and so the proper treatment of the fine details of merging and overlay operations is not available. The following guidelines outline the first steps in an implementation under DDL1.4.

The description assumes that a composite dictionary is to be assembled from two public dictionaries, a.dic and b.dic, and a local dictionary mod.dic that includes some modifications to the definitions in one or both of the public dictionaries (and is therefore processed in OVERLAY mode). It is assumed that the composite dictionary will be written to disk as a separate file, virtual.dic, although in practice applications may simply construct the image of the composite dictionary in memory.

(1) Each contributing dictionary fragment should have at most one data block containing the data names _dictionary_name and _dictionary_version (with, optionally, _dictionary_update and _dictionary_history). The *_name and *_version together identify the dictionary file uniquely and should match the corresponding entries in the IUCr register if this is a public dictionary. This information is conventionally stored in a data block named data_on_this_dictionary.

In DDL1.4, all four of the items _dictionary_name, *_version, *_update and *_history are scalars, i.e. may not be looped. Hence a new dictionary identifier section in virtual.dic may be constructed as follows.

(i) Create a data block data_on_this_dictionary at the beginning of virtual.dic.

(ii) If a name for the composite dictionary is supplied (via a command-line switch, for example), write this as the value of _dictionary_name; otherwise generate a pseudo-unique string (e.g. concatenate the computer identifier string, process number and current date string).

(iii) If a dictionary version number is supplied (via a command-line switch, for example), write this as the value of _dictionary_version; otherwise supply the value `1.0'.

(iv) Supply the current date in the format yyyy-mm-dd as the value of _dictionary_update.

(v) Create a composite _dictionary_history by concatenation of the individual _dictionary_history fragments. The application may add details of the current merge operation to the history field.

(2) There is no significance to the ordering of data blocks containing definitions in dictionaries, although they are conventionally sorted alphabetically. For convenience, data blocks should be written out in the order in which they are encountered in the input primitive dictionary files, except that definitions modified by subsequent entries remain in their initial location.

(3) In STRICT mode, if the same value of _name is present in two or more data blocks, the composite dictionary is invalid and the application should raise a fatal error. Otherwise the composite dictionary simply contains the aggregate definitions from multiple input dictionaries.

(4) In REPLACE mode, a stored definition block is discarded and replaced by a new definition of the item referenced by _name.

(5) For the OVERLAY mode (assumed in the present discussion), the following procedure is proposed. Load a data block from the first dictionary file. Locate the _name tag. (Because _name may be looped, a data block may contain definitions for more than one data name. For convenience, we consider only the case of a data block containing a single value of _name. In any event, it is possible to separate a set of looped definitions into individual data blocks, each defining only one of the data names in the initial _name loop.) Search the next dictionary file for a data block containing the same value of _name. Load the contents of that data block.

(i) If the new data block contains only data items that do not appear in the first data block, they are simply concatenated with those already present.

(ii) If the new data block contains a scalar data item already present in the first data block (i.e. with _list no), discard the stored attributes.

(iii) If the new data block contains data items that may be looped and that occur in the first data block, build a new composite table of values in the following way: (a) construct a valid loop header if necessary; (b) do not repeat identical sets of values (i.e. collapse identical table rows); (c) if it is possible to identify the category key, then raise a fatal error if there are identical instances of a key value [after the normalization of step (b) has occurred]; (d) else append new rows to the table.

When the new composite data block has been built according to these principles, search the next dictionary file specified and repeat.

to end of page
to top of page