International
Tables for Crystallography Volume H Powder diffraction Edited by C. J. Gilmore, J. A. Kaduk and H. Schenk © International Union of Crystallography 2018 |
International Tables for Crystallography (2018). Vol. H, ch. 3.8, pp. 329-331
Section 3.8.4. Data visualization^{a}Department of Chemistry, University of Glasgow, University Avenue, Glasgow, G12 8QQ, UK |
It is important when dealing with large data sets to have suitable visualization tools. These tools are also a valuable resource for exploring smaller data sets. This methodology provides four primary aids:
These aids give graphical views of the data that are semi-independent and thus can be used to check for consistency and discrepancies in the clustering. They are also interactive. No one method is optimal, and a combination of mathematical and visualization techniques is required, techniques that often need tuning for each individual application (Barr, Cunningham et al., 2009; Barr, Dong & Gilmore, 2009).
3.8.4.2. Secondary visualization using parallel coordinates, the grand tour and minimum spanning trees
In the MMDS and PCA methods p = 3 [equation (3.8.16)] to work in three dimensions; the X matrix can then be used to plot each pattern as a single point in a 3D graph. However, this has reduced the dimensionality of the data to three, and the question arises as to the validity of this: are three dimensions sufficient? The use of parallel-coordinates plots coupled with the grand tour can assist here as well as giving us an alternative view of the data.
A parallel-coordinates plot is a graphical data-analysis technique for plotting multivariate data. Usually orthogonal axes are used when doing this, but in parallel-coordinates plots orthogonality is abandoned and replaced with a set of N equidistant parallel axes, one for each variable and labelled X1, X2, X3,…, XN (Inselberg, 1985, 2009; Wegman, 1990). Each data point is plotted on each axis and the points are joined via a line connecting each data point. The data now become a set of lines. The lines are given the colours of the cluster to which they belong as defined by the current dendrogram. A parallel-coordinates display can be interpreted as a generalization of a two-dimensional scatterplot, and it allows the display of an arbitrary number of dimensions. The method can also be used to validate the clustering itself without using dendrograms. Using this technique it is possible to determine whether the clustering shown by the MMDS (or PCA) plot in three dimensions continues in higher dimensions.
Fig. 3.8.3 shows a typical example for a set of 80 organic samples partitioned into four clusters (Barr, Dong & Gilmore, 2009). The plot shows that the clustering looks realistic when viewed in this way and that it is maintained when the data are examined in six dimensions.
The grand tour is a method of animating the parallel-coordinates plot to examine it from all possible viewpoints. Consider a 3D data plot using orthogonal axes: a grand tour takes 2D sections through these data and displays them in parallel-coordinates plots in a way that explores the entire space in a continuous way. The former is important, because the data can be seen from all points of view, and the latter allows the user the follow the data without abrupt discontinuities. This concept was devised by Asimov (1985) and further developed by Wegman (1990). In more than three dimensions it becomes a generalized rotation of all the coordinate axes. A d-dimensional tour is a continuous geometric transformation of a d-dimensional coordinate system such that all possible orientations of the coordinate axes are eventually achieved. The algorithm for generating a smooth and complete view of the data is described by Asimov (1985).
To do this, the restriction of p = 3 in the MMDS calculation is relaxed to 6, so that there is now a 6D data set with six orthogonal axes. The choice of six is somewhat arbitrary – more can be used, but six is sufficient to see whether the clustering is maintained without generating unduly complex plots and requiring extensive computing resources. The data are plotted as a parallel-coordinates plot. The grand-tour method is then applied by a continuous geometric transformation of the 6D coordinate system such that all possible orientations of the axes are achieved. Each orientation is reproduced as a parallel-coordinates plot using six axes.
Figs. 3.8.9(j) and (k) show an example from the clustering of the 13 aspirin samples using PXRD data. Fig. 3.8.9(j) shows the default parallel-coordinates plot. Fig. 3.8.9(k) shows alternative views of the data taken from the grand tour. In Fig. 3.8.9(j) there appears to be considerable overlap between clusters in the 4th, 5th and 6th dimensions (X4, X5 and X6), but the alternative view given in Fig. 3.8.9(k) show that the clustering is actually well defined in all six dimensions (Barr, Dong & Gilmore, 2009).
The minimum spanning tree (MST) displays the MMDS plot as a tree whose points are the data from the MMDS calculation (in three dimensions) and whose weights are the distances between these points. The minimum-spanning-tree problem is that of joining the points with a minimum total edge weight. (As an example, airlines use minimum spanning trees to work out their basic route systems: the best set of routes taking into account airport hubs, passenger numbers, fuel costs etc. is the minimum spanning tree.) Because a tree is used, each point is only allowed a maximum of three connections to other points.
To do this Kruskal's (1956) algorithm can be used, in which the lowest weight edge is always added to see if it builds a spanning tree; if so, it is added or otherwise discarded. This process continues until the tree is constructed. An example is shown in Figs. 3.8.7 for the 13-sample aspirin data. A complete tree for this data set using three dimensions and the MMDS-derived coordinates is shown in Fig. 3.8.7(a). This has 12 links between the 13 data points. Reducing the number of links to 10 gives Fig. 3.8.7(b).
References
Asimov, D. (1985). The grand tour: a tool for viewing multidimensional data. SIAM J. Sci. Stat. Comput. 6, 128–143.Google ScholarBarr, G., Cunningham, G., Dong, W., Gilmore, C. J. & Kojima, T. (2009). High-throughput powder diffraction V: the use of Raman spectroscopy with and without X-ray powder diffraction data. J. Appl. Cryst. 42, 706–714.Google Scholar
Barr, G., Dong, W. & Gilmore, C. J. (2009). PolySNAP3: a computer program for analysing and visualizing high-throughput data from diffraction and spectroscopic sources. J. Appl. Cryst. 42, 965–974.Google Scholar
Inselberg, A. (1985). The plane with parallel coordinates. Vis. Comput. 1, 69–91.Google Scholar
Inselberg, A. (2009). Parallel Coordinates. Visual multidimensional geometry and its applications. New York: Springer.Google Scholar
Kruskal, J. B. (1956). On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Am. Math. Soc. 7, 48–50.Google Scholar
Wegman, E. J. (1990). Hyperdimensional data analysis using parallel coordinates. J. Am. Stat. Assoc. 85, 664–675.Google Scholar