International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

International Tables for Crystallography (2006). Vol. G, ch. 2.2, pp. 25-36

## Section 2.2.7. Formal specification of the Crystallographic Information File. Version 1.1 specification

S. R. Hall,a* N. Spadaccini,c I. D. Brown,d H. J. Bernstein,e J. D. Westbrookb and B. McMahonf

### 2.2.7. Formal specification of the Crystallographic Information File. Version 1.1 specification

| top | pdf |

This section presents the documents File syntax (Sections 2.2.7.1–3) and Common semantic features (Section 2.2.7.4) that together comprise the formal CIF specification as approved by COMCIFS.

| top | pdf |

#### 2.2.7.1.1. Introduction

| top | pdf |

(1) This document describes the full syntax of the Crystallographic Information File (CIF).

#### 2.2.7.1.2. Definition of terms

| top | pdf |

(2) The following terms are used in the CIF specification documents with the specific meanings indicated here.

(2.1) A CIF is a file conforming to the specification herein stated, containing either information on a crystallographic experiment or its results (or similar scientific content), or descriptions of the data identifiers in such a file.

(2.2) A data file is understood to convey information relating to a crystallographic experiment.

(2.3) A dictionary file is understood to contain information about the data items in one or more data files as identified by their data names.

(2.4) A data name is a case-insensitive identifier (a string of characters beginning with an underscore character) of the content of an associated data value.

(2.5) A data value is a string of characters representing a particular item of information. It may represent a single numerical value; a letter, word or phrase; extended discursive text; or in principle any coherent unit of data such as an image, audio clip or virtual-reality object.

(2.6) A data item is a specific piece of information defined by a data name and an associated data value.

(2.7) A tag is understood in this document to be a synonym for data name.

(2.8) A data block is the highest-level component of a CIF, containing data items or save frames. A data block is identified by a data-block header, which is an isolated character string (that is, bounded by white space and not forming part of a data value) beginning with the case-insensitive reserved characters data_.

(2.9) A block code is the variable part of a data-block header, e.g. the string foo in the header data_foo.

(2.10) A save frame is a partitioned collection of data items within a data block, started by a save-frame header, which is an isolated character string beginning with the case-insensitive reserved characters save_, and terminated with an isolated character string containing only the case-insensitive reserved characters save_.

(2.11) A frame code is the variable part of a save-frame header, e.g. the string foo in the header save_foo.

#### 2.2.7.1.3. File syntax

| top | pdf |

(3) The syntax of CIF is a proper subset of the syntax of STAR Files as described by Hall (1991) and Hall & Spadaccini (1994). The general structure is described below in Section 2.2.7.1.4 and a number of subsections list specific restrictions to the STAR syntax that are in force within CIF. A formal language grammar using computer-science notation is included as Section 2.2.7.2.

#### 2.2.7.1.4. General features

| top | pdf |

(4) A CIF consists of data names (tags) and associated values organized into data blocks. A data block may contain data items (associated data names and data values) and/or it may contain save frames.

(5) Save frames may only be used in dictionary files.

Implementation note: At a purely syntactic level there is no way to distinguish between dictionary and data files. (It is also to be noted that not all dictionary files contain save frames.) A fully validating parser must therefore be able to detect the start and termination of save frames, the uniqueness of the frame code within a data block and the uniqueness of data names within a frame code. It is, however, legitimate for an application-based parser designed to handle only the contents of data files to consider the presence of a save frame as an error.

(6) A data block begins with the reserved case-insensitive string data_ followed immediately by the name of the data block, forming a data-block header. A save frame has a similar structure to a data block, but may not itself contain further save frames. A save frame begins with the reserved case-insensitive string save_ followed immediately by the name of the save frame, forming a save-frame header. Unlike a data block, a save frame also has a marker for the end of the frame in the form of a repetition of the reserved case-insensitive word save_, this time without the name of the frame. Save frames may not nest. Within a single CIF, no two data blocks may have the same name; within a single data block no two save frames may have the same name, although a save frame may have the same name as a data block in the same CIF.

(7) A given data name (tag) [see (2.4) and (2.7)] may appear no more than once in a given data block or save frame. A tag may be followed by a single value, or a list of one or more tags may be marked by the preceding reserved case-insensitive word loop_ as the headings of the columns of a table of values. White space is used to separate a data-block or save-frame header from the contents of the data block or save frame, and to separate tags, values and the reserved word loop_. Data items (tags along with their associated values) that are not presented in a table of values may be relocated along with their values within the same data block or save frame without changing the meaning of the data block or save frame.  Complete tables of values (the table column headings along with all columns of data) may be relocated within the same data block or save frame without changing the meaning of the data block or save frame.  Within a table of values, each tag may be relocated along with its associated column of values within the same table of values without changing the meaning of the table of values.  In general, each row of a table of values may also be relocated within the same table of values without changing the meaning of the table of values.  Combining tables of values or breaking up tables of values would change the meanings, and is likely to violate the rules for constructing such tables of values.

(8) The case-insensitive word global_, used in STAR Files to introduce a group of data values with a scope extending to the end of the file, is an additional reserved word in CIF (that is, it may not be used as the unquoted value of any data item).

(9) If a data value (2.5) contains white space or begins with a character string reserved for a special purpose, it must be delimited by one of several sets of special character strings (the choice of which is constrained if the data value contains characters interpretable as marking a new line of text according to the discussion in the following paragraphs). Such a data value will be indicated by the term non-simple data value.

(10) A simple data value (i.e. one which does not contain white space or begin with a special character string) may optionally be delimited by any of the same set of delimiting character strings, except for data values that are to be interpreted as numbers.

(11) The special character strings in this context are listed in the following table. The term non-simple data values' in this table refers to data values beginning with these special character strings.

Character or stringRole
_ identifies data name
# identifies comment

#### 2.2.7.1.9.5. Exclusion of global_ blocks

| top | pdf |

(33) In the full STAR specification, blocks of data headed by the special case-insensitive word global_ are permitted before normal data blocks. They contain data names and associated values which are inherited in subsequent data blocks; the scope of a value extends from its point of declaration in a global block to the end of the file. Because rearrangements of the order of data blocks and concatenation of data blocks from different files are commonplace operations in many CIF applications, and because of the difficulty in properly tracking and implementing values implied by global blocks, use of the global_ feature of STAR is expressly forbidden at this revision. To guard against its future introduction, the special case-insensitive word global_ remains reserved in CIF.

#### 2.2.7.1.10. Version identification

| top | pdf |

(34) As an archival file format, the CIF specification is expected to change infrequently. Revised specifications will be issued to accompany each substantial modification. A CIF may be considered compliant against the most recent version for which in practice it satisfies all syntactic and content rules as detailed in the formal specification document. However, to signal the version against which compliance was claimed at the time of creation, or to signal the file type and version to applications (such as operating-system utilities), it is recommended that a CIF begin with a structured comment that identifies the version of CIF used. For CIFs compliant with the current specification, the first 11 bytes of the file should be the string immediately followed by one of the white-space characters permitted in paragraph (22).

| top | pdf |

#### 2.2.7.2.1. Summary

| top | pdf |

(35) The rows of Table 2.2.7.1 are called productions'. Productions are rules for constructing sentences in a language. They are written in terms of terminal symbols' and non-terminal symbols'. Terminal symbols' are what actually appear in a language. For example, 'poodle' might be given as a string of terminal symbols in some language discussing dogs. Non-terminal symbols are the higher-level constructs of the language, e.g. sentences, clauses, etc. For example <DOG> might be given as a non-terminal symbol in some language discussing dogs. Productions may be used to infer rules for parsing the language. For example, might be given as a rule telling us what names of types of dogs we are allowed to write in this language. In this table, terminal symbols (i.e. terminal character strings) are enclosed in single quotes. To avoid confusion, the terminal symbol consisting of a single quote (i.e. an apostrophe) is indicated by <single_quote> and the terminal symbol consisting of a double quote is indicated by <double_quote>. The printable space character is indicated by <SP>, the horizontal tab character by <HT> and the end of a line by <eol>. To allow for the occurrence of a semicolon as the initial character of an unquoted character string, provided it is not the first character in a line of text, the special symbol <noteol> is used below to indicate any character that is not interpretable as a line terminator. The cases of context sensitivity involving the beginning of text fields and the ends of quoted strings are discussed below, but they are most commonly resolved in a lexical scan.

 Table 2.2.7.1| top | pdf | A formal grammar for CIF
 (a) Basic structure of a CIF.
Syntactic unitSyntaxCase sensitive?
<CIF> <Comments>? <WhiteSpace>? { <DataBlock> { <WhiteSpace> <DataBlock> }* { <WhiteSpace> }? }? yes
<DataBlock> <DataBlockHeading> {<WhiteSpace> { <DataItems> | <SaveFrame>} }* yes
<DataBlockHeading> <DATA_> { <NonBlankChar> }+ no
<SaveFrame> <SaveFrameHeading> { <WhiteSpace> <DataItems> }+ <WhiteSpace> <SAVE_> yes
<SaveFrameHeading> <SAVE_> { <NonBlankChar> }+ no
<DataItems> <Tag> <WhiteSpace> <Value> | <LoopHeader> <LoopBody> yes
<LoopBody> <Value> { <WhiteSpace> <Value> }* yes
 (b) Reserved words.
Syntactic unitSyntaxCase sensitive?
<DATA_> {'D'|'d'} {'A'|'a'} {'T'|'t'} {'A'|'a'} '_' no
<LOOP_> {'L'|'l'} {'O'|'o'} {'O'|'o'} {'P'|'p'} '_' no
<GLOBAL_> {'G'|'g'} {'L'|'l'} {'O'|'o'} {'B'|'b'} {'A'|'a'} {'L'|'l'} '_' no
<SAVE_> {'S'|'s'} {'A'|'a'} {'V'|'v'} {'E'|'e'} '_' no
<STOP_> {'S'|'s'} {'T'|'t'} {'O'|'o'} {'P'|'p'} '_' no
 (c) Tags and values.
Syntactic unitSyntaxCase sensitive?
<Tag> '_'{ <NonBlankChar>}+ no
<Value> { '.' | '?' | <Numeric> | <CharString> | <TextField> } yes
 (d) Numeric values.
Syntactic unitSyntaxCase sensitive?
<Numeric> { <Number> | <Number> '(' <UnsignedInteger> ')' } no
<Number> {<Integer> | <Float> } no
<Integer> {{ '+' | '-' }? <UnsignedInteger> no
<Float> { <Integer><Exponent> | { {'+'|'-'} ? { {<Digit>} * '.' <UnsignedInteger> } | { <Digit>} + '.' } } {<Exponent>} ? } } no
<Exponent> { {'e' | 'E' } | {'e' | 'E' } { '+' | '-' } } <UnsignedInteger> no
<UnsignedInteger> { <Digit> }+ no
<Digit> { '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' } no
 (e) Character strings and text fields.
Syntactic unitSyntaxCase sensitive?
<CharString> <UnquotedString> | <SingleQuotedString> | <DoubleQuotedString> yes
<eol><UnquotedString> <eol><OrdinaryChar> {<NonBlankChar>}* yes
<noteol><UnquotedString> <noteol>{<OrdinaryChar>|';'} {<NonBlankChar>}* yes
<SingleQuotedString><WhiteSpace> <single_quote>{<AnyPrintChar>}* <single_quote> <WhiteSpace> yes
<DoubleQuotedString><WhiteSpace> <double_quote> {<AnyPrintChar>}* <double_quote> <WhiteSpace> yes
<TextField> { <SemiColonTextField> } yes
<eol><SemiColonTextField> <eol>';' { {<AnyPrintChar>}* <eol> {{<TextLeadChar> {<AnyPrintChar>}*}? <eol>}* } ';' yes
Syntactic unitSyntaxCase sensitive?
<Comments> { '#' {<AnyPrintChar>}* <eol>}+ yes
 (g) Character sets.
Syntactic unitSyntaxCase sensitive?
<OrdinaryChar>' { '!'|'%'|'&'|'('|')'|'*'|'+'|','|'-'|'.'|'/'|'0'|'1'|'2'|'3'|'4'|'5'| '6'|'7'|'8'|'9'|':'|'<'|'='|'>'|'?'|'@'|'A'|'B'|'C'|'D'|'E'|'F'|'G'|'H'| 'I'|'J'|'K'|'L'|'M'|'N'|'O'|'P'|'Q'|'R'|'S'|'T'|'U'|'V'|'W'|'X'|'Y'|'Z'| '\'|'^'|''|'a'|'b'|'c'|'d'|'e'|'f'|'g'|'h'|'i'|'j'|'k'|'l'|'m'|'n'|'o'| 'p'|'q'|'r'|'s'|'t'|'u'|'v'|'w'|'x'|'y'|'z'|'{'|'|'|'}'|'~' } yes
<NonBlankChar> <OrdinaryChar>|<double_quote>|'#'|'$'|<single_quote>|'_' |';'|'['|']' yes <TextLeadChar> <OrdinaryChar>|<double_quote>|'#'|'$'|<single_quote>|'_'|<SP>|<HT>|'['|']' yes
<AnyPrintChar> <OrdinaryChar>|<double_quote>|'#'|'$'|<single_quote>|'_'|<SP>|<HT>|';'|'['|']' yes (36) Productions can be used to produce documents, or equivalently to check a document to see if it is valid in this grammar. The angle brackets delimit names for the syntactic units (the non-terminal symbols') being defined. The curly braces enclose alternatives separated by vertical bars and/or followed by a plus sign for one or more', an asterisk for zero or more' or a question mark for zero or one'. (37) In most cases, each production has a single non-terminal symbol in the syntactic unit being defined. However, in some cases, both the syntactic unit and the syntax begin or end with some common symbol. This indicates that a specific context is required in order for the rule to be applied. This is done because the initial semicolon of a semicolon-delimited text field only has meaning at the beginning of a line, and quoted strings may contain their initial quoting character provided the embedded quoting character is not immediately followed by white space. This context-sensitive' notation is unusual in defining computer languages (although very common in the full specifications of many computer and non-computer languages). This context-sensitive notation greatly simplifies the definitions and is simple to implement. The formal definitions are elaborated below. (38) In the present revision, the production for <TextField> is a trivial equivalence to <SemiColonTextField>. The redundancy is retained to permit possible future extensions to text fields, in particular the possible introduction of a bracket-delimited text value. #### 2.2.7.2.2. Explanation of the formal syntax | top | pdf | Comment: Readers not familiar with the conventions used in describing language grammars may wish to consult various lecture notes on the subject available on the web, e.g. Bernstein (2002). (39) In creating a parser for CIF, the normal process is to first perform a lexical scan' to identify tokens' in the CIF. A token' is a grammatical unit, such as a special character, or a tag or a value, or some major grammatical subunit. In the course of a lexical scan, the input stream is reduced to manageable pieces, so that the rest of the parsing may be done more efficiently. The convention followed in this document is to mark the non-terminal' tokens that are built up out of actual strings of characters or which do not have an immediate representation as printable characters by angle brackets, <>, and to indicate the tokens that are actual strings of characters as quoted strings of characters. (40) The precise division between a lexical scan and a full parse is a matter of convenience. A suggested division is presented. Before getting to that point, however, there are some highly machine-dependent matters that need to be resolved. There must be a clear understanding of the character set to be used, and of how files and lines begin and end. The character set will be specified in terms of printable characters and a few control characters from the 7-bit ASCII character set. In addition, we will need some means of specifying the end of a line. (41) The character set in CIF is restricted to the ASCII control characters <HT> (horizontal tab, position 09 in the ASCII character set), <NL> (newline, position 10 in the ASCII character set, also named <LF>) and <CR> (carriage return, position 13 in the ASCII character set), and the printable characters in positions 32–126 of the ASCII character set. These are the characters permitted by STAR with the exception of VT (vertical tab, position 11 in the ASCII character set) and FF (form feed, position 12 in the ASCII character set). In general it is poor practice to use characters that are not common to all national variants of the ISO character set. On systems or in programming languages that do not work in ASCII', the characters themselves may have different numeric values and in some cases there is no access to all the control characters. (42) The <eol> token stands for the system-dependent end of line. Implementation note: CIF implementations may follow common HTML and XML practice in handling <eol>: [On many modern systems,] lines are typically separated by some combination of the characters carriage-return (\#xD) and line-feed (\#xA). To simplify the tasks of applications, the characters passed to an application … must be as if the … [parser] normalized all line breaks in external parsed entities … on input, before parsing, [e.g.] by translating both the two-character sequence \#xD \#xA and any \#xD that is not followed by \#xA to a single \#xA character.' (From the XML specification http://www.w3.org/TR/2000/REC-xml-20001006 .) Because Unix systems use \n (the ASCII LF control character, or \#xA), MS Windows systems use \r\n (the ASCII CR control character, or \#xD, followed by the ASCII LF control character, or \#xA) and classic MacOS systems use \r, a parser which covers a wide range of systems in a reasonable manner could be constructed using a pseudo-production for <eol> such as provided the supporting infrastructure (such as the lexer) deals with the necessary minor adjustment to ensure that each end of line is recognized and that all end-of-line control characters are filtered out from the portions of the text stream that are to be processed by other productions. One case to handle with care is the end-of-document case. It is not uncommon to encounter a last line in a document that is not terminated by any of the above-mentioned control characters. Instead, it may be terminated by the end of the character stream or by a special end-of-text-document control character [e.g. \#x4 (control-D) or \#x1A (control-Z)]. A CIF parser should normalize such unterminated terminal lines to appear to an application as if they had been properly terminated. On the other hand, care should also be taken so that in multiple generations of CIF processing such processing does not result in an ever-growing tail' of empty lines at the end of a CIF document. This discussion is not meant to imply that a parser for a system that uses one of these line-termination conventions must recognize a CIF written using another of these line-termination conventions. This discussion is not meant to imply that parsers on systems that use other line-termination conventions and/or non-ASCII character sets need to handle these ASCII control characters. In processing a valid CIF document, it is always sufficient that a parser be able to recognize the line-termination conventions of text files local to its system environment, and that it be able to recognize the local translations of <SP><HT> and the printable characters used to construct a CIF. However, when circumstances permit, if a parser is able to recognize alien' line terminations, it is permissible for the parser to accept and process the CIF in that form without treating it as an error. In writing CIF documents, the software that emits lines should follow the text-file line-termination conventions of the target system for which it is writing the CIF documents, and not mix conventions from multiple systems. In transmitting a CIF document from system to system, software should be used that causes the document to conform to the line-termination conventions of the target system. In most cases this objective can best be achieved by using text' or ascii' transmission modes, rather than binary' or image' transmission modes. (43) In order to write the grammar, we need a way to refer to the single-quote characters which we use both to quote within the syntax and to quote within a CIF. To avoid system-dependent confusion, we define the following special tokens: TokenMeaning <SP>  ', the printable space character <HT> the horizontal-tab character on the system <eol> the machine-dependent end of line <noteol> the complement of the above; any character that does not indicate the machine-dependent end of line <single_quote> the apostrophe, ' <double_quote> the double-quote character, " (44) There are CIF specifications not definable directly in a context-free Backus–Naur form (BNF). Restrictions in record and data-name lengths, and the parsing of text fields and quoted character strings are best handled in the initial lexical scan. A pure BNF can then be used to parse the tokenized input stream. #### 2.2.7.3. Lexical tokens | top | pdf | (45) We define a comment' to be initiated with the character #. This can be followed by any sequence of characters (which include <SP> or <HT>). The only characters not allowed are those in the production <eol>, which <eol> terminates a comment. A comment is recognized only at the beginning of a line or after blanks, i.e. only after space, tab or <eol>. For this reason we define both comments and tokenized comments'. No portion of the essential machine-readable content within a CIF is conveyed by the comments. Comments are for the convenience of human readers of CIFs and may be freely introduced or removed. Note however the optional structured comment sanctioned in paragraph (34) above, which has the purpose of indicating the file type and revision level to general-purpose file-handling software. (46) We accept as white space all appropriate combinations of spaces, tabs, end of lines and comments, as well as the beginning of the file. White space are the characters able to delimit the lexical tokens. (47) Non-blank characters are composed of all the characters in our set, excluding <SP> and <HT> and <eol> characters. (48) AnyPrintChar characters are composed of all the characters in our set excluding <eol> characters. (49) We define a line of text' to be a line contained within a semicolon-bounded text field. Hence the first character cannot be a semicolon; it may be followed by any number of characters from the set <char> and terminated with a line-termination character. We define the characters in <TextLeadChar> as those in <AnyPrintChar> except for the semicolon. (50) Ordinary characters are all those printable characters that can initiate a non-quoted character string. These exclude the special characters ", #,$, ', [, ] and _, and in some cases ;.

(51) The reserved word data_ (in a case-insensitive form).

(52) The reserved word loop_ (in a case-insensitive form).

(53) The reserved word save_ (in a case-insensitive form).

(54) The reserved word stop_ (in a case-insensitive form).

(55) The reserved word global_ (in a case-insensitive form). This is actually a reserved word of STAR, but we define it here so that it may be explicitly excluded as an unquoted string. We do this so that any possible future adoption of STAR features will not invalidate existing CIFs.

(56) Quoted strings need to be recognized in the lexical scan, because their definition is context-sensitive. A string quoted by single quotes may contain a single quote as long as it is not followed by white space. A string quoted by double quotes may contain a double quote as long as it is not followed by white space. Formally we express this with context-sensitive productions. In practice, it requires a one-character look-ahead to decide to continue the scan if the opening quote is encountered, but the following character is not space, tab or end of line. When processing a semicolon-delimited text field, the column position has to be remembered to decide whether a semicolon should be recognized.

For a semicolon-delimited text string, failure to provide trailing white space is an error. The <WhiteSpace> on the left-hand side must evaluate to the same string instance on the right-hand side and the parse must terminate on the first valid match reading left to right.

(57) Tags and values are appropriate lexical tokens. The special values of  .' and  ?' represent data that are inapplicable or unknown, respectively.

(i) No string that matches the production for <LOOP_> is accepted as a non-quoted string.

(ii) No string that matches the production for <STOP_> is accepted as a non-quoted string.

(iii) No string in which the initial five characters match the production for <DATA_> is accepted as a non-quoted string.

(iv) No string in which the initial five characters match the production for <SAVE_> is accepted as a non-quoted string.

(v) No string that matches the production for <GLOBAL_> is accepted as a non-quoted string.

Unquoted strings are described by a pair of productions to permit the initial letter of an unquoted string to be a semicolon so long as that does not occur at the beginning of a line. The parser is required to evaluate <noteol> to the same string instance on both sides of the production.

#### 2.2.7.3.1. CIF grammar

| top | pdf |

(58) A CIF may be an empty file, or it may contain only comments or white space, or it may contain one or more data blocks. Comments before the first block are acceptable, and there must be white space between blocks.

(59) For a data block, there must be a data heading and zero or more data items or save frames.

(60) A data-block heading consists of the five characters data_ (case-insensitive) immediately followed by at least one non-blank character selected from the set of ordinary characters or the non-quote-mark, non-blank printable characters.

(61) For a save frame, there must be a save-frame heading, some data items and then the reserved word save_.

(62) A save-frame heading consists of the five characters save_ (case-insensitive) immediately followed by at least one non-blank character selected from the set of ordinary characters or the non-quote-mark, non-blank printable characters.

(63) Data come in two forms:

(i) A data-name tag separated from its associated value by a <WhiteSpace>.

(ii) Looped data. The number of values in the body must be a multiple of the number of tags in the header.

| top | pdf |

#### 2.2.7.4.1. Introduction

| top | pdf |

(1) The Crystallographic Information File (CIF) standard is an extensible mechanism for the archival and interchange of information in crystallography and related structural sciences. Ultimately CIF seeks to establish an ontology for machine-readable crystallographic information – that is, a collection of statements providing the relations between concepts and the logical rules for reasoning about them.

Essential components in the development of such an ontology are:

(a) the basic rules of grammar and syntax, described in Sections 2.2.7.1 to 2.2.7.3;

(b) a vocabulary of the tags or data names specifying particular objects;

(c) a taxonomy, or classification scheme relating the specified objects;

(d) descriptions of the attributes and relationships of individual and related objects.

In the CIF framework, the objects of discourse are described in so-called data dictionary files that provide the vocabulary and taxonomic elements. The dictionaries also contain information about the relationships and attributes of data items, and thus encapsulate most of the semantic content that is accessible to software. In practice, different dictionaries exist to service different domains of crystallography and a CIF that conforms to a specific dictionary must be interpreted in terms of the semantic information conveyed in that dictionary.

However, some common semantic features apply across all CIF applications, and the current document outlines the foundations upon which other dictionaries may build more elaborate taxonomies or informational models.

#### 2.2.7.4.2. Definition of terms

| top | pdf |

(2) The definitions of Section 2.2.7.1.2 also hold for this part of the specification.

#### 2.2.7.4.3. Semantics of data items

| top | pdf |

(3) While the STAR File syntax allows the identification and extraction of tags and associated values, the interpretation of the data thus extracted is application-dependent. In CIF applications, formal catalogues of standard data names and their associated attributes are maintained as external reference files called data dictionaries. These dictionary files share the same structure and syntax rules as data CIFs.

(4) At the current revision, two conventions (known as dictionary definition languages or DDLs) are supported for detailing the meaning and associated attributes of data names. These are known as DDL1 (Hall & Cook, 1995) and DDL2 (Westbrook & Hall, 1995), and they differ in the amount of detail they carry about data types, the relationships between specific data items and the large-scale classification of data items.

(5) While it may be formally possible to define the semantics of the data items in a given data file in both DDL1 and DDL2 data dictionaries, in practice different dictionaries are constructed to define the data names appropriate for particular crystallographic applications, and each such dictionary is written in DDL1 or DDL2 formalism according to which appears better able to describe the data model employed. There is thus in practice a bifurcation of CIF into two dialects according to the DDL used in composing the relevant dictionary file. However, the use of aliases may permit applications tuned to one dialect to import data constructed according to the other.

#### 2.2.7.4.4. Data-name semantics

| top | pdf |

(6) Strictly, data names should be considered as void of semantic content – they are tags for locating associated values, and all information concerning the meaning of that value should be sought in an associated dictionary.

(7) However, it is customary to construct data names as a sequence of components elaborating the classification of the item within the logical structure of its associated dictionary. Hence a data name such as _atom_site_fract_x displays a hierarchical arrangement of components corresponding to membership of nested groupings of data elements. The choice of components readily indicates to a human reader that this data item refers to the fractional x coordinate of an atomic site within a crystal unit cell, but it should be emphasized from a computer-programming viewpoint that this is coincidental; the attributes that constrain the value of this data item (and its relationship to others such as _atom_site_fract_y and _atom_site_fract_z) must be obtained from the dictionary and not otherwise inferred.

(8) Comment: In practice data names described in a DDL2 dictionary are constructed with a period character separating their specific function from the name of the category to which they have been assigned. In the absence of a dictionary file, this convention permits the inference that the data item with name _atom_site.fract_x will appear in the same looped list as other items with names beginning _atom_site., and that all such items belong to the same category.

#### 2.2.7.4.5. Name space

| top | pdf |

(9) The intention of the maintainers of public CIF dictionaries is to formulate a single authoritative set of data names for each CIF dialect (i.e. DDL1 and DDL2), thus facilitating the reliable archive and interchange of crystallographic data. However, it is also permissible for users to introduce local data names into a CIF. Two mechanisms exist to reduce the danger of collision of data names that are not incorporated into public dictionaries.

(10) The character string [local] (including the literal bracket characters) is reserved for local use. That is, no public dictionary will define a data name that includes this string. This allows experimentation with data items in a strictly local context, i.e. in cases where the CIF is not intended for interchange with any other user.

(11) Where CIFs including local data items are expected to enjoy a public circulation, authors may register a reserved prefix for their sole use. The registry is available on the web at http://www.iucr.org/iucr-top/cif/spec/reserved.html .

A reserved prefix, e.g. foo, must be used in the following ways:

(i) If the data file contains items defined in a DDL1 dictionary, the local data names assigned under the reserved prefix must contain it as their first component, e.g. _foo_atom_site_my_item.

(ii) If the data file contains items defined in a DDL2 dictionary, then the reserved prefix must be:

 (a) the first component of data names in a category defined for local use, e.g. _foo_my_category.my_item. (b) the first component following the period character in a data name describing a new item in a category already defined in a public dictionary, e.g. _atom_site.foo_ my_item.

(12) There is no syntactic property identifying such a reserved prefix, so that software validating or otherwise handling such local data names must scan the entire registry and match registered prefixes against the indicated components of data names. Note that reserved prefixes may not themselves contain underscore characters.

#### 2.2.7.4.6. Note on handling of units

| top | pdf |

(13) The published specification for CIF version 1.0 permitted data values expressed in different units to be tagged by variant data names (Hall et al., 1991, p. 657):

… Many numeric fields contain data for which the units must be known. Each CIF data item has a default units code which is stated in the CIF Dictionary. If a data item is not stored in the default units, the units code is appended to the data name. For example, the default units for a crystal cell dimension are ångströms. If it is necessary to include this data item in a CIF with the units of picometres, the data name of _cell_length_a is replaced by _cell_length_a_pm. Only those units defined in the CIF Dictionary are acceptable. The default units, except for the ångström, conform to the SI Standard adopted by the IUCr.

This approach is deprecated and has not been supported by any official CIF dictionary published subsequent to version 1.0 of the core. All data values must be expressed in the single unit assigned in the associated dictionary.

A small number of archived CIFs exist with variant data names as permitted by the above clause. If it is necessary to validate them against versions of the core dictionary subsequent to version 1.0, the formal compatibility dictionary cif_compat.dic (ftp://ftp.iucr.org/cifdics/cif_compat.dic ) may be used for the purpose. No other use should be made of this dictionary.

#### 2.2.7.4.7. Data-value semantics

| top | pdf |

(14) The STAR syntax permits retrieval of data by simply requesting a specific data name within a specific data block. Prior knowledge about data type (e.g. text or numbers), whether the item is looped or whether the item exists in the file at all is unnecessary. However, applications in general need to know data type, valid ranges of values and relationships between data items, and a program designer needs to know the purpose of the data item (i.e. what physical quantity or internal book-keeping function it represents). While such semantic information may be defined informally for local data items (ones not intended for exchange between different users or software applications), formal descriptions of the semantics associated with data values are catalogued in data dictionary files. Currently two formalisms (dictionary definition languages) for describing data-value attributes are supported; full specifications of these formalisms (known as DDL1 and DDL2) are provided in Chapters 2.5 and 2.6 .

#### 2.2.7.4.7.1. Data typing

| top | pdf |

(15) Four base data types are supported in CIF. These are:

(i) numb: a value interpretable as a decimal base number and supplied as an integer, a floating-point number or in scientific notation;

(ii) char: a value to be interpreted as character or text data (where the value contains white-space characters, it must be quoted);

(iii) uchar: a value to be interpreted as character or text data but in a case-insensitive manner (i.e. the values FOO and foo are to be taken as identical);

(iv) null: a special data type associated with items for which no definite value may be stored in computer memory. It is the type associated with the special character literal values ? (query mark) and . (full point), which may appear as values for any data item within a data file (see Section 2.2.7.4.8 below). It is also the type assigned to items defined in dictionary files that may not occur in data files.

(16) Comment: Many applications distinguish between multi-line text fields and character-string values that fit within a single line of text. While this is a convenient practical distinction for coding purposes, formally both manifestations should be regarded as having the same base type, which might be char' or uchar'. Applications are at liberty to choose whether to define specific multi-line text subtypes, and whether to permit casting between subtypes of a base type. The examples of character-string delimiters in Section 2.2.7.1.4(20) are predicated on an approach that handles all subtypes of character or text data equivalently.

(17) Where the attributes of a data value are not available in a dictionary listing, it may be assumed that a character string interpretable as a number should be taken to represent an item of type numb'. However, an explicit dictionary declaration of type will override such an assumption.

#### 2.2.7.4.7.2. Subtyping

| top | pdf |

(18) The base data types detailed in the previous section are very general and need to be refined for practical application. Refinement of types is to some extent application-dependent, and different subtypes are supported for data items defined by DDL1 and DDL2 dictionary files. The following notes indicate some considerations, but the relevant dictionary files and documentation should be consulted in each case.

(19) DDL1 dictionaries. Values of type numb' may include a standard uncertainty in the final digit(s) of the number where the associated item definition includes the attribute (or _type_conditions su, a synonym introduced to DDL1 in 2005). For example, a value of 34.5(12) means 34.5 with a standard uncertainty of 1.2; it may also be expressed in scientific notation as 3.45E1(12).

(20) DDL2 dictionaries. DDL2 provides a number of tags that may be used in a dictionary file to specify subtypes for data items defined by that dictionary alone. Examples of the subtypes specified for the macromolecular CIF dictionary are:

 code identifying code strings or single words ucode identifying code strings or single words (case-insensitive) uchar1 single-character codes (case-insensitive) uchar3 three-character codes (case-insensitive) line character strings forming a single line of text uline character strings forming a single line of text (case-insensitive) text multi-line text int integers float floating-point real numbers yyyy-mm-dd dates symop symmetry operations any any type permitted

#### 2.2.7.4.8. Special generic values

| top | pdf |

(21) The unquoted character literals ? (query mark) and . (full point) are special and are valid expressions for any data type.

(22) The value ? means that the actual value of a requested data item is unknown.

(23) The value . means that the actual value of a requested data item is inapplicable. This is most commonly used in a looped list where a data value is required for syntactic integrity.

#### 2.2.7.4.9. Embedded data semantics

| top | pdf |

(24) The attributes of data items defined in CIF dictionaries serve to direct crystallographic applications in the retrieval, storage and validation of relevant data. In principle, a CIF might include as data items suitably encoded fields representing data suitable for manipulation by text processing, image, spreadsheet, database or other applications. It would be useful to have a formal mechanism allowing a CIF to invoke appropriate content handlers for such data fields; this is under investigation for the next CIF version specification.

#### 2.2.7.4.10. CIF conventions for special characters in text

| top | pdf |

(25) The one existing example of embedded semantics is the text character markup introduced in the CIF version 1.0 specification and summarized in paragraphs (30)–(37) below. The specification is silent on which fields should be interpreted according to these markup conventions, but the published examples suggest that they may be used in any character field in a CIF data file except as prohibited by a dictionary directive. It is intended that the next CIF version specification shall formally declare where such markup may be used.

#### 2.2.7.4.11. Handling of long lines

| top | pdf |

(26) The restriction in line length within CIF requires techniques to handle without semantic loss the content of lines of text exceeding the limit (2048 characters in this revision, 80 characters in the initial CIF specification). The line-folding protocol defined here provides a general mechanism for wrapping lines of text within CIFs to any extent within the overall line-length limit. A specific application where this would be useful is the conversion of lines longer than 80 characters to the CIF version 1.0 limit. This 80-character limit is used in the examples below for illustrative purposes.

These techniques are applied only to the contents of text fields and to comments.

In order to permit such folding, a special semantics is defined for use of the backslash. It is important to understand that this does not change the syntax of CIF version 1.0. All existing CIFs conforming to the CIF version 1.0 specification can be viewed as having exactly the same semantics as they now have. Use of these transformational semantics is optional, but recommended.

In order to avoid confusion between CIFs that have undergone these transformations and those that have not, the special comment beginning with a hash mark immediately followed by a backslash ( #\) as the last non-blank characters on a line is reserved to mark the beginning of comments created by folding long-line comments, and the special text field beginning with the sequence line termination, semicolon, backslash ( <eol>;\) as the only non-blank characters on a line is reserved to mark the beginning of text fields created by folding long-line text fields.

The backslash character is used to fold long lines in character strings and comments. Consider a comment which extends beyond column 80. In order to provide a comment with the same meaning which can be fitted into 80-character lines, prefix the comment with the special comment consisting of a hash mark followed by a backslash ( #\) and the line terminator. Then on new lines take appropriate fragments of the original comment, beginning each fragment with a hash mark and ending all but the last fragment with a backslash. In doing this conversion, check for an original line that ends with a backslash followed only by blanks or tabs. To preserve that backslash in the conversion, add another backslash after it. If the next lexical token (not counting blanks or tabs) is another comment, to avoid fusing this comment with the next comment, be sure to insert a line with just a hash mark.

Similarly, for a character string that extends beyond column 80,

(i) first convert it to be a text field delimited by line termination–semicolon ( <eol>;) sequences,

(ii) then change the initial line termination–semicolon ( <eol>;) sequence to line termination–semicolon–backslash–line termination ( <eol>;<eol>),

(iii) and break all subsequent lines that do not fit within 80 columns with a trailing backslash. In the course of doing the translation,

 (a) check for any original text lines that end with a backslash followed only by blanks or tabs; (b) to preserve that backslash in the conversion, add another backslash after it, and then an empty line.

(More formally, the line folding should be done separately and directly on single-line non-semicolon-delimited character strings to allow for recognition of the fact that no terminal line termination is intended – see below.)

In order to understand this scheme, suppose the CIF fragment (1) below were considered to have long lines. They could be transformed into (2) as follows:

(1) Initial CIF

(2) Transformed CIF

In making the transformation from the backslash-folded form to long lines, it is very important to strip trailing blanks before attempting to recognize a backslash as the last character. When reassembling text-field lines, no reassembly should be done except in text fields that begin with the special sequence described above, line termination–semicolon–backslash–line termination, ( <eol>;<eol>), so that text fields that happen to contain backslashes but which were not created by folding long lines are not changed. It is also important to remove the trailing backslashes when reassembling long lines. The final line termination–semicolon sequence of a text field takes priority over the reassembly process and ends it, but a trailing backslash on the last line of a text field very nicely conveys the information that no trailing line termination is intended to be included within the character string.

Similarly, when reassembling long-line comments, the reassembly begins with a comment of the form hash–backslash–line termination. The initial hash mark is retained and then a forward scan is made through line terminations and blanks for the next comment, from which the initial hash mark is stripped and then the contents of the comment are appended. If that comment ends with a backslash, the trailing backslash is stripped and the process repeats. Note that the process will be ended by intervening tags, values, data blocks or other non-white-space information, and that the process will not start at all without the special hash–backslash–line termination comment.

Since there are very few, if any, CIFs that contain text fields and comments beginning this way, in most cases it is reasonable to adopt the policy of doing this processing unless it is disabled.

Here is another example of folding. The following three text fields would be equivalent: and but the following example would be a two-line value where the first line had the value C:\foldername\file\ and the second had the value name:

Note that backslashes should not be used to fold lines outside of comments and text fields. That would introduce extraneous characters into the CIF and violate the basic syntax rules. In any case, such action is not necessary.

#### 2.2.7.4.12. Dictionary compliance

| top | pdf |

(27) Dictionary files containing the definitions and attribute sets for the data items contained in a CIF should be identified within the CIF by some or all of the data items corresponding to DDL1 dictionaries or for DDL2 dictionaries. Where no such information is provided, it may be assumed that the file should conform against the core CIF dictionary.

(28) The _audit_conform data items may be looped in cases where more than one dictionary is used to define the items in a CIF and they may include dictionaries of local data items provided such dictionary files have been prepared in accordance with the rules of the appropriate DDL.

(29) A detailed protocol exists for locating, merging and overlaying multiple dictionary files (McMahon et al., 2000) (see Section 3.1.9 ).

#### 2.2.7.4.13. CIF markup conventions

| top | pdf |

(30) If permitted by the relevant dictionary and if no other indication is present, the contents of a text or character field are assumed to be interpretable as text in English or some other human language. Certain special codes are used to indicate special characters or accented letters not available in the ASCII character set, as listed below.

#### 2.2.7.4.14. Greek letters

| top | pdf |

(31) In general, the corresponding letter of the Latin alphabet, prefixed by a backslash character. The complete set is:

 A \a \A alpha N \n \N nu B \b \B beta O \o \O omicron X \c \C chi \p \P pi \d \D delta \q \Q theta E \e \E epsilon R \r \R rho \f \F phi \s \S sigma \g \G gamma T \t \T tau H \h \H eta U \u \U upsilon I \i \I iota \w \W omega K \k \K kappa \x \X xi \l \L lambda \y \Y psi M \m \M mu Z \z \Z zeta

#### 2.2.7.4.15. Accented letters

| top | pdf |

(32) Accents should be indicated by using the following codes before the letter to be modified (i.e. use \'e for an acute e):

 \' acute (é) \" umlaut (ü) \= overbar or macron (ā) \ grave (à) \~ tilde (ñ) \. overdot () \^ circumflex (â) \; ogonek (ų) \< hacek or caron () \, cedilla (ç) \> Hungarian umlaut or double accented (ő) \( breve ()

These codes will always be followed by an alphabetic character.

#### 2.2.7.4.16. Other characters

| top | pdf |

(33) Other special alphabetic characters should be indicated as follows:

 \%a a-ring (å) \?i dotless i (ı) \&s German `ss' (ß) \/o o-slash (ø) \/l Polish l (ł) \/d barred d (đ)

Capital letters may also be used in these codes, so an ångström symbol (Å) may be given as \%A.

(34) Superscripts and subscripts should be indicated by bracketing relevant characters with circumflex or tilde characters, thus:

 superscripts Csp^3^ for Csp3 subscripts U~eq~ for Ueq

(35) Some other codes are accepted by convention. These are:

 \% degree (°) \times × -- dash +- --- single bond -+ \db double bond \square \tb triple bond \neq \ddb delocalized double bond \rangle \sim \langle (Note: ~  is the code for subscript) \rightarrow \simeq \leftarrow \infty

Note that \db, \tb and \ddb should always be followed by a space, e.g. C=C is denoted by C\db C.

#### 2.2.7.4.17. Typographic style codes

| top | pdf |

(36) The codes indicated above are designed to refer to special characters not expressible within the CIF character set, and the initial specification did not permit markup for typographic style such as italic or bold-face type. However, in some cases the ability to indicate type style is useful, and in addition to the codes above HTML-like conventions are allowed of surrounding text by <i> </i> to indicate the beginning and end of italic, and by <b> </b> to indicate the beginning and end of bold-face type.

(37) If it is necessary to convey more complex typographic information than is permitted by these special character codes and conventions, the entire text field should be of a richer content type allowing detailed typographic markup.

### References

Bernstein, H. J. (2002). Some comments on parsing for computer programming languages. http://www.bernstein-plus-sons.com/TMM/Parsing .
Hall, S. R. (1991). The STAR File: a new format for electronic data transfer and archiving. J. Chem. Inf. Comput. Sci. 31, 326–333.
Hall, S. R., Allen, F. H. & Brown, I. D. (1991). The Crystallographic Information File (CIF): a new standard archive file for crystallography. Acta Cryst. A47, 655–685.
Hall, S. R. & Cook, A. P. F. (1995). STAR dictionary definition language: initial specification. J. Chem. Inf. Comput. Sci. 35, 819–825.
Hall, S. R. & Spadaccini, N. (1994). The STAR File: detailed specifications. J. Chem. Inf. Comput. Sci. 34, 505–508.
McMahon, B., Westbrook, J. D. & Bernstein, H. J. (2000). Report of the COMCIFS Working Group on Dictionary Maintenance. http://www.iucr.org/iucr-top/cif/spec/dictionaries/maintenance.html .
Westbrook, J. D. & Hall, S. R. (1995). A dictionary description language for macromolecular structure. http://ndbserver.rutgers.edu/mmcif/ddl/ .