International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

International Tables for Crystallography (2006). Vol. G, ch. 5.1, pp. 482-483

Section 5.1.2.2. Data-representation frameworks

H. J. Bernsteina*

aDepartment of Mathematics and Computer Science, Kramer Science Center, Dowling College, Idle Hour Blvd, Oakdale, NY 11769, USA
Correspondence e-mail: yaya@bernstein-plus-sons.com

5.1.2.2. Data-representation frameworks

| top | pdf |

A data-representation framework provides the concepts for managing data and data about the management of data (`metadata'). Such frameworks may be based on programming languages or markup languages or built from scratch. They provide a mechanism for representing data (e.g. as data sets, graphs or trees) and a mechanism for representing metadata (e.g. as dictionaries or schemas). Four are of particular importance in crystallography: CIF, ASN.1, HDF and XML.

As noted in Chapter 1.1[link] , CIF was created to rationalize the publication process for small molecules. It combines a very simple tag–value data representation with a dictionary definition language (DDL) and well populated dictionaries. CIF is table-oriented, naturally row-based, has case-insensitive tags and allows two levels of nesting. CIF is order-independent and uses its dictionaries both to define the meanings of its tags and to parameterize its tags. It is interesting to note that, even though CIF is defined as order-independent, it effectively fills the role of an order-dependent markup language in the publication process. We will discuss this issue later in this chapter.

Abstract Syntax Notation One (ASN.1) (Dubuisson, 2000[link]; ISO, 2002[link]) was developed to provide a data framework for data communications, where great precision in the bit-by-bit layout of data to be seen by very different systems is needed. Although targeted for communications software, ASN.1 is suitable for any application requiring precise control of data structures and, as such, primarily supports the metadata of an application, rather than the data. ASN.1 can be compiled directly to C code. The resulting C code then supports the data of the application. ASN.1 notation found application in NCBI's macromolecular modelling database (Ohkawa et al., 1995[link]). ASN.1 has case-sensitive tags and allows case-insensitive variants. It manages order-dependent data structures in a mixed order-dependent/order-independent environment.

HDF (NCSA, 1993[link]) is `a machine-independent, self-describing, extendible file format for sharing scientific data in a heterogeneous computing environment, accompanied by a convenient, standardized, public domain I/O library and a comprehensive collection of high quality data manipulation and analysis interfaces and tools' (http://ssdoo.gsfc.nasa.gov/nost/formats/hdf.html ). HDF was adopted by the Neutron and X-ray Data Format (NeXus) effort (Klosowski et al., 1997[link]). HDF allows the building of a complete data framework, representing both data and metadata. Two parallel threads of software development, focused on the management and exchange of raw data from area detectors, began in the mid-1990s: the Crystallographic Binary File (CBF) (Hammersley, 1997[link]) and NeXus. The volumes of data involved were daunting and efficiency of storage was important. Therefore both proposed formats assumed a binary format. CBF was based on a combination of CIF-like ASCII headers with compressed binary images. NeXus was based on HDF. The first API for CBF was produced by Paul Ellis in 1998. CBF rapidly evolved into CBF/imgCIF with a complete DDL2 dictionary and a fully CIF-compliant API (Chapter 5.6[link] ). As of mid-2004, NeXus was still evolving (see http://www.nexusformat.org/ ).

XML is a simplified form of SGML, drawing on years of development of tools for SGML and HTML. XML is tree-oriented with case-sensitive entity names. It allows unlimited nesting and is order-dependent. Metadata are managed as a `document type definition' (DTD), which provides minimal syntactic information, or as schemas, which allow for more detail and are more consistent with database conventions. In fields close to crystallography, the first effort at adopting XML was the chemical markup language (CML) (Murray-Rust & Rzepa, 1999[link]). CML is intentionally imprecise in its ontology to allow for flexibility in development. The CSD and PDB have released their own XML representations (http://www.ccdc.cam.ac.uk/support/documentation/relibase/3_0/relibase_DPG/toc.html ; http://pdbml.rcsb.org ).

It may seem from this discussion that the application designer faces an unmanageable variety of data frameworks in an unstable, evolving environment. To some extent this is true. Fortunately, however, there are signs of convergence on CIF dictionary-based ontologies and the use of transliterated CIFs. This means that an application adapted to CIF should be relatively easy to adapt to other data frameworks.

References

ISO (2002). ISO/IEC 8824–1. Abstract Syntax Notation One (ASN.1). Specification of basic notation. Geneva: International Organization for Standardization.
NCSA (1993). NCSA HDF: specification and developer's guide. Version 3.2. University of Illinois at Urbana-Champaign, USA.
Dubuisson, O. (2000). ASN.1 – communication between heterogeneous systems. San Francisco, CA: Morgan Kaufmann. (Translated from the French by P. Fouquart.)
Hammersley, A. P. (1997). FIT2D: an introduction and overview. ESRF Internal Report ESRF97HA02T. Grenoble: ESRF.
Klosowski, P., Koennecke, M., Tischler, J. Z. & Osborn, R. (1997). NeXus: a common format for the exchange of neutron and synchrotron data. Physica B Condens. Matter, B241–243, 151–153.
Murray-Rust, P. & Rzepa, H. (1999). Chemical markup, XML and the WWW, Part I: Basic principles. J. Chem. Inf. Comput. Sci. 39, 928–942.
Ohkawa, H., Ostell, J. & Bryant, S. (1995). MMDB: an ASN.1 specification for macromolecular structure. In Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology, Cambridge, England, 16–19 July 1995, pp. 259–267. Menlo Park, CA: American Association for Artificial Intelligence.








































to end of page
to top of page