International
Tables for
Crystallography
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

International Tables for Crystallography (2006). Vol. G, ch. 2.2, pp. 29-30

Section 2.2.7.2.2. Explanation of the formal syntax

S. R. Hall,a* N. Spadaccini,c I. D. Brown,d H. J. Bernstein,e J. D. Westbrookb and B. McMahonf

2.2.7.2.2. Explanation of the formal syntax

| top | pdf |

Comment: Readers not familiar with the conventions used in describing language grammars may wish to consult various lecture notes on the subject available on the web, e.g. Bernstein (2002[link]).

(39) In creating a parser for CIF, the normal process is to first perform a `lexical scan' to identify `tokens' in the CIF. A `token' is a grammatical unit, such as a special character, or a tag or a value, or some major grammatical subunit. In the course of a lexical scan, the input stream is reduced to manageable pieces, so that the rest of the parsing may be done more efficiently. The convention followed in this document is to mark the `non-terminal' tokens that are built up out of actual strings of characters or which do not have an immediate representation as printable characters by angle brackets, <>, and to indicate the tokens that are actual strings of characters as quoted strings of characters.

(40) The precise division between a lexical scan and a full parse is a matter of convenience. A suggested division is presented. Before getting to that point, however, there are some highly machine-dependent matters that need to be resolved. There must be a clear understanding of the character set to be used, and of how files and lines begin and end. The character set will be specified in terms of printable characters and a few control characters from the 7-bit ASCII character set. In addition, we will need some means of specifying the end of a line.

(41) The character set in CIF is restricted to the ASCII control characters <HT> (horizontal tab, position 09 in the ASCII character set), <NL> (newline, position 10 in the ASCII character set, also named <LF>) and <CR> (carriage return, position 13 in the ASCII character set), and the printable characters in positions 32–126 of the ASCII character set. These are the characters permitted by STAR with the exception of VT (vertical tab, position 11 in the ASCII character set) and FF (form feed, position 12 in the ASCII character set). In general it is poor practice to use characters that are not common to all national variants of the ISO character set. On systems or in programming languages that do not `work in ASCII', the characters themselves may have different numeric values and in some cases there is no access to all the control characters.

(42) The <eol> token stands for the system-dependent end of line.

Implementation note: CIF implementations may follow common HTML and XML practice in handling <eol>:

`[On many modern systems,] lines are typically separated by some combination of the characters carriage-return (\#xD) and line-feed (\#xA). To simplify the tasks of applications, the characters passed to an application … must be as if the … [parser] normalized all line breaks in external parsed entities … on input, before parsing, [e.g.] by translating both the two-character sequence \#xD \#xA and any \#xD that is not followed by \#xA to a single \#xA character.'

(From the XML specification http://www.w3.org/TR/2000/REC-xml-20001006 .)

Because Unix systems use \n (the ASCII LF control character, or \#xA), MS Windows systems use \r\n (the ASCII CR control character, or \#xD, followed by the ASCII LF control character, or \#xA) and classic MacOS systems use \r, a parser which covers a wide range of systems in a reasonable manner could be constructed using a pseudo-production for <eol> such as [Scheme scheme10] provided the supporting infrastructure (such as the lexer) deals with the necessary minor adjustment to ensure that each end of line is recognized and that all end-of-line control characters are filtered out from the portions of the text stream that are to be processed by other productions. One case to handle with care is the end-of-document case. It is not uncommon to encounter a last line in a document that is not terminated by any of the above-mentioned control characters. Instead, it may be terminated by the end of the character stream or by a special end-of-text-document control character [e.g. \#x4 (control-D) or \#x1A (control-Z)]. A CIF parser should normalize such unterminated terminal lines to appear to an application as if they had been properly terminated. On the other hand, care should also be taken so that in multiple generations of CIF processing such processing does not result in an ever-growing `tail' of empty lines at the end of a CIF document.

This discussion is not meant to imply that a parser for a system that uses one of these line-termination conventions must recognize a CIF written using another of these line-termination conventions.

This discussion is not meant to imply that parsers on systems that use other line-termination conventions and/or non-ASCII character sets need to handle these ASCII control characters.

In processing a valid CIF document, it is always sufficient that a parser be able to recognize the line-termination conventions of text files local to its system environment, and that it be able to recognize the local translations of <SP><HT> and the printable characters used to construct a CIF.

However, when circumstances permit, if a parser is able to recognize `alien' line terminations, it is permissible for the parser to accept and process the CIF in that form without treating it as an error.

In writing CIF documents, the software that emits lines should follow the text-file line-termination conventions of the target system for which it is writing the CIF documents, and not mix conventions from multiple systems. In transmitting a CIF document from system to system, software should be used that causes the document to conform to the line-termination conventions of the target system. In most cases this objective can best be achieved by using `text' or `ascii' transmission modes, rather than `binary' or `image' transmission modes.

(43) In order to write the grammar, we need a way to refer to the single-quote characters which we use both to quote within the syntax and to quote within a CIF. To avoid system-dependent confusion, we define the following special tokens:

TokenMeaning
<SP> ` ', the printable space character
<HT> the horizontal-tab character on the system
<eol> the machine-dependent end of line
<noteol> the complement of the above; any character that does not indicate the machine-dependent end of line
<single_quote> the apostrophe, '
<double_quote> the double-quote character, "

(44) There are CIF specifications not definable directly in a context-free Backus–Naur form (BNF). Restrictions in record and data-name lengths, and the parsing of text fields and quoted character strings are best handled in the initial lexical scan. A pure BNF can then be used to parse the tokenized input stream.

References

Bernstein, H. J. (2002). Some comments on parsing for computer programming languages. http://www.bernstein-plus-sons.com/TMM/Parsing .








































to end of page
to top of page