Tables for
Volume G
Definition and exchange of crystallographic data
Edited by S. R. Hall and B. McMahon

International Tables for Crystallography (2006). Vol. G, ch. 5.3, pp. 499-501

Section 5.3.2. Syntax checker

B. McMahona*

aInternational Union of Crystallography, 5 Abbey Square, Chester CH1 2HU, England
Correspondence e-mail:

5.3.2. Syntax checker

| top | pdf |

A CIF must conform to a subset of the syntax rules of a general STAR File (Chapter 2.1[link] , but with the additional restrictions and conventions described in Chapter 2.2[link] . The syntax is rather simple and robust subroutines to create CIFs may easily be written by computer programmers. However, the use of ASCII character sets, deliberately expressive data names and simple layout conventions both permit and encourage users to edit the files with general text editors that cannot guarantee to retain syntactic integrity. Consequently, there is a definite use for a simple program that can check whether a file conforms to the specified syntax.

It is worth mentioning that programmable text editors such as emacs may be supplied with rules that can check syntax as a file is edited. A simple rule set (known as a mode file) has been developed (Winn, 1998[link]) to indicate the different components of a CIF, as a first step towards a syntax-checking emacs mode.

The Star.vim utility of Section 5.2.4[link] provides a similar functionality for editing in the vim environment, although it is not capable of validation directly; nevertheless, the appearance of unexpected or irregular highlighted text can draw the user's attention to syntactic problems, a feature that is also useful in more extended editors such as enCIFer (Section[link]). vcif

| top | pdf |

A simple syntax checker for CIF is the program vcif (McMahon, 1998[link]), which scans a text file and outputs informative messages about apparent errors. While conservative CIF parsing software will quit upon finding an error, vcif will attempt to read to the end of the file and list all clearly distinguished errors. However, its interpretation of errors depends on a close adherence to the CIF syntax specification and makes no assumption about the intended purpose of the character strings it reads. In consequence, a single logical error such as failing to terminate a multiple-line text string may cause the program to report many other apparent errors as it proceeds out of phase through the rest of the file. How to use vcif

| top | pdf |

The program may be run under Unix or DOS by typing

vcif filename

where filename is the name of the file to test. If filename is given as the hyphen character -, the program will read standard input. Standard input will also be read if no file name is supplied; this allows the program to be used in a pipeline of commands.

A number of options may be supplied to the program to modify its behaviour. Without these options (i.e. invoked as above) a brief but informative message is written to the standard output channel for each occurrence of what the program perceives to be a syntax error.

For example, for the incorrect sample file of Fig.[link](a), the output is listed in Fig.[link](b).


Figure | top | pdf |

(a) An example CIF with a number of syntax errors and (b) the report of the errors produced by vcif.

Note that the sequence number of the line in which the error occurs is printed. The summary error message is output on a single line (longer lines have been wrapped and indented in Fig.[link] for legibility). Where the type of error necessarily affects only a single line, the program can recover and correctly identify errors on subsequent lines. Where possible, unexpected character strings are printed to help the user to identify the error. No attempt is made to assign any meaning to the data names or the data values in the file. Hence the same logical error (the detachment of a standard uncertainty in parentheses from its parent value) is indicated variously as an unexpected text string or as an extraneous loop item, depending on where it occurs in the file. Indeed, in the case of the incorrect number of loop elements, the program makes no attempt to identify which data value or values in the loop might be in error: it simply counts the number of values in a loop and complains when this is not a multiple of the number of data names declared in the loop header. Options to vcif

| top | pdf |

A number of options may be supplied as command-line arguments to modify the output from vcif.

A more complete account is given of each error on its first occurrence when the program is invoked with the `-v' option. The output listing explains in more detail what the breach of syntax is and sometimes suggests how misunderstandings of the file structure result in such breaches (Fig.[link]).


Figure | top | pdf |

Verbose error listing from vcif when run with the `-v' option on the example of Fig.[link].

Each error message is prefaced by the word `ERROR' (or occasionally another phrase such as `WARNING' or `STAR ERROR'). Three chevrons preface a printout of the beginning of the troublesome line. Then an expanded description of the error is given, prefaced by three asterisks, on the first occurrence of each distinct error. In this mode, only the first 20 errors are listed (the assumption is that this mode is best suited to novices, who should identify and correct each error in turn and would not want to be swamped by large numbers of error messages arising from a single error). More errors may be reported by using the `-e' command-line option.

The quiet option (vcif -q) outputs no error messages but instead returns to the calling environment an integer giving the total number of errors found. This option allows scripts or external programs to use vcif as a silent test of whether a file has any syntax errors. A related option, vcif -b, counts errors and returns the result as an integer to the calling environment, as in the previous case; but additionally outputs a list of all the data-block codes in the file. While adding nothing to the syntax-checking function of the program, this provides a useful small utility for simply listing data-block names.

Although intended for use with the restricted STAR File syntax permitted for CIF (Chapter 2.2[link] , vcif may also be used with the `-s' option to check the syntax of CIF dictionary files, which may include save frames. The program does not, however, handle nested loop structures.

The program will flag as an error any line of greater than 80 characters length (the original limit in the CIF version 1.0 specification; see Chapter 2.2[link] , but this behaviour may be overridden with the `-l' option. If used, only lines longer than the specified number of characters will be reported and the reports of such lines will be prefaced with the word `WARNING'. Likewise, the `-w' option may be used to override the CIF version 1.0 restriction of data names and data-block codes to 32 characters.

Other options allow the program to write extensive debugging information to a user-specified file, indicating its internal state upon processing each token of input, and to list either a brief summary of how it may be used or its current version number. Limitations of vcif

| top | pdf |

Because the program is testing certain properties of character strings within logical lines of a file, it stores a line at a time for further internal processing. If a line contains a null character (an ASCII character with integer value zero), this will be taken as the termination of the string currently being processed, according to the normal conventions in the C programming language for marking the end of a text string. In this case, subsequent error messages may not reflect the real problem. The null character, of course, is not allowed in a CIF.

vcif also interprets syntax rules literally, so a misplaced semicolon might mean that a large section of the file is regarded as a text field and too many or too few error messages are generated. This can make a correct interpretation of the causative errors difficult for a novice user.


McMahon, B. (1998). vcif: a utility to validate the syntax of a Crystallographic Information File. .
Winn, M. (1998). cif.el: an Emacs mode for CIF. Daresbury Laboratory, Warrington, England.

to end of page
to top of page