International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

International Tables for Crystallography (2006). Vol. G, ch. 2.3, pp. 37-39

Section 2.3.2. CBF and imgCIF

H. J. Bernsteina* and A. P. Hammersleyb

aDepartment of Mathematics and Computer Science, Kramer Science Center, Dowling College, Idle Hour Blvd, Oakdale, NY 11769, USA, and bESRF/EMBL Grenoble, 6 rue Jules Horowitz, France
Correspondence e-mail:  yaya@bernstein-plus-sons.com

2.3.2. CBF and imgCIF

| top | pdf |

CBF and imgCIF are two aspects of the same format. Since CIFs are pure ASCII text files, it was necessary to define a separate binary format to allow the combination of pseudo-ASCII sections and binary data sections. In the binary-file CBF format, the ASCII sections conform closely to the CIF standard but must use operating-system-independent `line separators'. In order to facilitate interchange of files, an API that writes CBF files should use \r\n (carriage return, line feed) for the line separator. Use of this line separator allows the ASCII sections to be viewed with standard system utilities (e.g. `more', `pg') on a very wide range of operating systems (e.g. Unix, MacOS and Windows). However, an API that reads CBF format must accept any of the following three alternative line terminators as the end of an ASCII line: \r, \n or \r\n. As for all CIF data sets, an imgCIF file conforms to the normal text file-writing conventions of the system on which it is written. imgCIF is also one of the two names of the CIF dictionary (see Chapter 4.6[link] ) that contains the terms specific to describing image data in both CBF and imgCIF data sets. Thus a CBF or imgCIF data set uses data names from the CBF/imgCIF dictionary and other CIF dictionaries.

The general structure of a CBF or imgCIF data set is shown in Example 2.3.2.1[link]. After a special comment to identify the file type (a so-called `magic number') and any other initial comments, the data set begins with a `data_blockname' which gives the name of the data block. Tags and values that describe the image data and how they were collected come next. For efficiency in processing, it is recommended that all the descriptive tags come before the actual image data. This recommendation is a requirement for the binary CBF format. It is optional for the ASCII imgCIF format. The image data are given as the value of the tag _array_data.data. The image data are given in a text field, using MIME conventions to describe the encoding.

Example 2.3.2.1. General structure of a CBF or imgCIF data set. The critical values that define an image are marked with [\vrule width2pt height10pt].

[Scheme scheme1]

2.3.2.1. A simple example

| top | pdf |

Before describing the format in full, we start by showing a simple but important and complete use of the format: that of storing a single detector image in a file together with a small amount of auxiliary information. This is intended to be a useful example that can be understood without reference to the full definitions. It also serves as an introduction or overview of the format definition. This example uses CIF DDL2-based dictionary items (see Chapter 2.6[link] ).

Example 2.3.2.2[link] relates to an image of 768 × 512 pixels stored as 16-bit unsigned integers, in little-endian byte order (this is the native byte ordering on a PC). The pixel sizes are 100.5 × 99.5 µm.

Example 2.3.2.2. A single image.

[Scheme scheme2]

The example will be presented and discussed in three sections. The circled numerals are included to allow us to comment on portions of the example. They are not part of the CBF/imgCIF format.

The line marked by the circled number 1, starting with a hash character (#), is a CIF and CBF comment line. As a first line, the pattern of three hashes followed by `CBF' helps to identify the data set as a CBF. It is a so-called `magic number'. The text ###CBF: VERSION must be present as the very first line of every CBF file. Following `VERSION' is the number of the corresponding version of the CBF/imgCIF extension dictionary and supporting documentation. Comment lines and white space (blanks and newlines) may appear anywhere outside the binary sections. In an imgCIF data set, the descriptive tags and values may be presented in any convenient order, e.g. the data could come first and the parameters necessary to interpret the data could come later. This order-independent convention holds for an imgCIF file, but for a CBF all the tags and values describing binary data (i.e. all the tags other than those in the ARRAY_DATA category) should be presented before the binary data, in the form of a header. This does not mean that there cannot be more useful information after the binary data. There could be another full header and more blocks of binary data. In the interest of efficiency in processing a CBF, the parameters that relate to a particular block of binary data must appear earlier in the CBF than the block itself.

The header begins at the line marked with the circled number 2. The data_ token is the CIF token for identifying a data block. The name of the data block, image_1, follows immediately without any intervening white space. The name of the data block is arbitrary. Within a data block any given tag may be presented only once, either directly with a value following immediately, or as one of the column headings for the rows of a table. To reuse the same tag one must start a new data block.

Information about the image begins at the line marked with the circled number 3. In the following lines, the apostrophes enclose strings that contain a space. Values that contain white space, or that could be confused with a CIF token, must always be quoted. A double quote ( ") could have been used. There is a third way to quote a string, with the string \r\n;, i.e. with a semicolon at the beginning of a line, which allows multi-line strings to be presented. We shall use this form of text quotation for the binary data.

The experimental details begin at the line marked with the circled number 4. Many more data items could be defined, but here we are giving an example of one useful minimal (but not mandatory) set. See the imgCIF dictionary in Chapter 4.6[link] and the classification of image data in Chapter 3.7[link] for a discussion of which items are mandatory.

After describing the parameters of the experiment, we describe the organization of the image data (Example 2.3.2.3[link]).

Example 2.3.2.3. Organization of image data in a CBF/imgCIF.

[Scheme scheme3]

Note that we have changed from listing a value directly with each tag to a tabular format, using the CIF loop_ token.

The *.array_id tags identify data items belonging to the same array. Here we have chosen the name image_1, but another name could have been used, as long as it is used consistently. The *.index tags refer to the dimension being defined, and the *.dimension column defines the number of elements in that dimension. The *.precedence tag defines the precedence of rastering of the data. In this case, the first dimension is the faster changing dimension. The *.direction column tells us the direction in which the data raster runs within a dimension. Here the data raster runs from the minimum element towards the maximum element (`increasing') in the first dimension, and from the maximum element towards the minimum element in the second dimension. This is the default rastering order.

We have given the abstract ordering of the data. The physical view of data is described in detail in the CBF/imgCIF dictionary (Chapters 3.7[link] and 4.6[link] ).

In general, the physical sense of the image is from the sample to the detector.

The storage of the binary data is now fully defined. Further data items could be defined, but we are ready to present the image data (Example 2.3.2.4[link]). This is done with the ARRAY_DATA category. The actual binary data will come just a little further down, as the essential part of the value of _array_data.data, which begins as semicolon-quoted text.

Example 2.3.2.4. Representation of the binary data.

[Scheme scheme4]

The line immediately after the line with the semicolon is a MIME boundary marker. As with all MIME boundary markers, it begins with `--' (two hyphens). The next few lines are MIME headers, describing some useful information we will need in order to process the binary section. MIME headers can appear in different orders and can be very confusing (look at the raw contents of an e-mail message with attachments), but there are only a few headers that have to be understood to process a CBF.

The `Content-Type' header serves to describe the nature of the following data sufficiently to allow an association with an appropriate agent or mechanism for presenting the data to a user. It may be any of the discrete types permitted in RFC 2045 (Freed & Borenstein, 1996b[link]), but, unless the binary data conform to an existing standard format (e.g. TIFF or JPEG), the description `application/octet-stream' is recommended. If the octet stream has been compressed, the compression should be specified by the parameter conversions="x-CBF_PACKED" or by specifying one of the other compression types allowed as described in Chapter 5.6[link] .

The `Content-Transfer-Encoding' header describes any encoding scheme applied to the data, most commonly to transform it to an ASCII-only representation. For a CBF the value should be `BINARY'. We consider the other values used for imgCIF below.

The `X-Binary-Size' header specifies the size of the binary data in octets. Calculation of the size where compression is used is described in Section 2.3.3.3[link]. The `X-Binary-Element-Type' header specifies the type of binary data in the octets, using the same descriptive phrases as in _array_structure.encoding_type (the default value is `unsigned 32-bit integer').

The other MIME headers in the example provide an identifier and a content checksum. The MIME header items are followed by an empty line and then by a special sequence (marked here as START_OF_BIN), consisting of the single characters Ctrl-L, Ctrl-Z, Ctrl-D and a single binary flag character of hexadecimal value D5 (213 decimal). The binary data follow immediately after this flag character. The reasons for choosing this sequence are discussed in Section 2.3.3.3[link].

After the last octet (i.e. byte) of the binary data, there is a special trailer \r\n--CIF-BINARY-FORMAT-SECTION----\r\n;. This repeats the initial boundary marker with an extra -- at the end (a MIME convention for the last boundary marker), followed by a closing semicolon quote for a text section. This is essential in an imgCIF, and we include it in a CBF for consistency.

References

Freed, N. & Borenstein, N. (1996b). Multipurpose Internet Mail Extensions (MIME) part one: Format of Internet message bodies. RFC 2045. Network Working Group. http://www.ietf.org/rfc/rfc2045.txt .








































to end of page
to top of page