International
Tables for Crystallography Volume H Powder diffraction Edited by C. J. Gilmore, J. A. Kaduk and H. Schenk © International Union of Crystallography 2018 
International Tables for Crystallography (2018). Vol. H, ch. 3.8, pp. 325343
https://doi.org/10.1107/97809553602060000953 Chapter 3.8. Clustering and visualization of powderdiffraction data^{a}Department of Chemistry, University of Glasgow, University Avenue, Glasgow, G12 8QQ, UK The use of clustering to classify powder Xray diffraction patterns is presented. The associated visualization techniques are also presented. The methods provide an easy way to handle large volumes of data. 
In highthroughput crystallography, crystallization experiments using robotics coupled with automatic sample changers and twodimensional (2D) detectors can generate and measure over 1000 powderdiffraction patterns on a series of related compounds, often polymorphs or salts, in a day (Storey et al., 2004). It is also possible to simultaneously measure spectroscopic data, especially Raman (Alvarez et al., 2009). The analysis of these patterns poses a difficult statistical problem: a need to classify the data by putting the samples into clusters based on diffractionpattern similarity so that unusual samples can be readily identified. At the same time, suitable visualization tools to help in the dataclassification process are required; the techniques of classification and visualization go handinhand. Interestingly, the techniques developed for large data sets with poorquality data also have great value when looking at smaller data sets, and the visualization tools developed for highthroughput studies are especially useful when looking at phase transitions, mixtures etc.
In this chapter the methods for comparing whole patterns will be described. The mathematics of cluster analysis will then be explained, followed by a discussion of the associated visualization tools. Examples using small data sets from pharmaceuticals, inorganics and phase transitions will be given; the techniques used can be readily scaled up for handling large, highthroughput data sets. The same methods also work for spectroscopic data and the use of such information with and without powder Xray diffraction (PXRD) data will be discussed. Finally, the use of visualization tools in quality control is demonstrated.
Comparing 1D diffraction patterns or spectra cannot be done by simply using the peaks and their relative intensities for a number of reasons:
In order to use the information contained within the full profile, algorithms are required that utilize each measured data point in the analysis. We use two correlation coefficients for the purpose of comparing PXRD patterns: the Pearson and the Spearman coefficients.
Consider two diffraction patterns, i and j, each with n measured points n((x_{1}, y_{1}), …, (x_{n}, y_{n})). These are transformed to ranks R(x_{k}) and R(y_{k}). The Spearman test (Spearman, 1904) then gives a correlation coefficient (Press et al., 2007),where .
Pearson's r is a parametric linear correlation coefficient widely used in crystallography. It has a similar form to Spearman's test, except that the data values themselves, and not their ranks, are used:where and are the means of intensities taken over the full diffraction pattern. Again, r can lie between −1.0 and +1.0.
Fig. 3.8.1 shows the use of the Pearson and Spearman correlation coefficients (Barr et al., 2004a). In Fig. 3.8.1(a) r = 0.93 and R = 0.68. The high parametric coefficient arises from the perfect match of the two biggest peaks, but the much lower Spearman coefficient acts as a warning that there are unmatched regions in the two patterns. In Fig. 3.8.1(b) the situation is reversed: r = 0.79, whereas R = 0.90, and it can be seen that there is a strong measure of association with the two patterns, although there are some discrepancies in the region 15–35°. In Fig. 3.8.1(c) r = 0.66 and R = 0.22; in this case the Spearman test is again warning of missing match regions. Thus, the use of the two coefficients acts as a valuable balance of their respective properties when processing complete patterns. The Spearman coefficient is also robust in the statistical sense and useful in the case of preferred orientation.
Correlation coefficients are not additive, so it is invalid to average them directly; they need to be transformed into the Fisher Z value to give
Before performing pattern matching, some data preprocessing may be necessary. In order not to produce artefacts, this should be minimized. Typical preprocessing activities are:
After preprocessing, which needs to be carried out in an identical way for each sample, the following steps are carried out:
Using equation (3.8.3), a correlation matrix is generated in which a set of n patterns is matched with every other to give a symmetric (n × n) correlation matrix ρ with unit diagonal. The matrix ρ can be converted to a Euclidean distance matrix, d, of the same dimensions viaor a distancesquared matrix,for each entry i, j in d, . A correlation coefficient of 1.0 translates to a distance of 0.0, a coefficient of −1.0 to 1.0, and zero to 0.5. There are other methods of generating a distance matrix from ρ (see, for example, Gordon, 1981, 1999), but we have found this to be both simple and as effective as any other.
For other purposes a dissimilarity matrix s is also needed, whose elements are defined viawhere d^{max} is the maximum distance in matrix d. A dissimilarity matrix, δ, is also generated with elementsIn some cases it can be advantageous to use I^{1/2} in the distancematrix generation; this can enhance the sensitivity of the clustering to weak peaks (Butler et al., 2019).
Cluster analysis uses d (or s, or δ) to partition the patterns into groups based on the similarity of their diffraction profiles. Associated with cluster are a number of important ancillary techniques all of which will be discussed here. A flowchart of these methods is shown in Fig. 3.8.4.
Using d and s, agglomerative, hierarchical cluster analysis is now carried out, in which the patterns are put into clusters as defined by their distances from each other. [Gordon (1981, 1999) and Everitt et al. (2001) provide excellent and detailed introductions to the subject. Note that the two editions of Gordon's monograph are quite distinct and complementary.] The method begins with a situation in which each pattern is considered to be in a separate cluster. It then searches for the two patterns with the shortest distance between then, and joins them into a single cluster. This continues in a stepwise fashion until all the patterns form a single cluster. When two clusters (C_{i} and C_{j}) are merged, there is the problem of defining the distance between the newly formed cluster and any other cluster C_{k}. There are a number of different ways of doing this, and each one gives rise to a different clustering of the patterns, although often the difference can be quite small. A general algorithm has been proposed by Lance & Williams (1967), and is summarized in a simplified form by Gordon (1981). The distance from the new cluster formed by merging C_{i} and C_{j} to any other cluster C_{k} is given byThere are many possible clustering methods. Table 3.8.1 defines six commonly used clustering methods, defined in terms of the parameters α, β and γ. All these methods can be used with powder data; in general, the groupaveragelink or singlelink formalism is the most effective, although differences between the methods are often slight.

The results of cluster analysis are usually displayed as a dendrogram, a typical example of which is shown in Fig. 3.8.6(a), where a set of 13 powder patterns is analysed using the centroid method. Each pattern begins at the bottom of the plot as a separate cluster, and these amalgamate in stepwise fashion linked by horizontal tie bars. The height of the tie bar represents a similarity measure as measured by the relevant distance. As an indication of the differences that can be expected in the various algorithms used for dendrogram generation, Fig. 3.8.6(e) shows the same data analysed using the singlelink method: the resulting clustering is slightly different: the similarity measures are larger, and, in consequence, the tie bars are higher on the graph. [For further examples see Barr et al. (2004b,c) and Barr, Dong, Gilmore & Faber (2004).]
An estimate of the number of clusters present in the data set is needed. In terms of the dendrogram, this is equivalent to `cutting the dendrogram' i.e. the placement of a horizontal line across it such that all the clusters as defined by tie lines above this line remain independent and unlinked. The estimation of the number of clusters is an unsolved problem in classification methods. It is easy to see why: the problem depends on how similar the patterns need to be in order to be classed as the same, and how much variability is allowed within a cluster. We use two approaches: (a) eigenvalue analysis of matrices ρ and A, and (b) those based on cluster analysis.
Eigenvalue analysis is a well used technique: the eigenvalues of the relevant matrix are sorted in descending order and when a fixed percentage (typically 95%) of the data variability has been accounted for, the number of eigenvalues is selected. This is shown graphically via a scree plot, an example of which is shown in Fig. 3.8.2.
We carry out eigenvalue analysis on the following:
The most detailed study on cluster counting is that of Milligan & Cooper (1985), and is summarized by Gordon (1999). From this we have selected three tests that seem to operate effectively with powder data:
The results of tests (4)–(6) depend on the clustering method being used. To reduce the bias towards a given dendrogram method, these tests are carried out on four different clustering methods: the singlelink, the groupaverage, the sumofsquares and the completelink methods. Thus there are 12 semiindependent estimates of the number of clusters from clustering methods, and three from eigenanalysis, making 15 in all.
A composite algorithm is used to combine these estimates. The maximum and minimum values of the number of clusters (c_{max} and c_{min}, respectively) given by the eigenanalysis results [(1)–(3) above] define the primary search range; tests (4)–(6) are then used in the range to find local maxima or minima as appropriate. The results are averaged, any outliers are removed, and a weighted mean value is taken of the remaining indicators, then this is used as the final estimate of the number of clusters. Confidence levels for c are also defined by the estimates of the maximum and minimum cluster numbers after any outliers have been removed.
A typical set of results for the PXRD data from 23 powder patterns for doxazosin (an antihypertension drug) in which five polymorphs are present, as well as two mixtures of polymorphs, is shown in Fig. 3.8.2(a) and (b) (see also Table 3.8.2). The scree plot arising from the eigenanalysis of the correlation matrix indicates that 95% of the variability can be accounted for by five components, and this is shown in Fig. 3.8.2(a). Eigenvalues from other matrices indicate that four clusters are appropriate. A search for local optima in the CH, γ and C tests is then initiated in the range 2–8 possible clusters. Four different clustering methods are tried, and the results indicate a range of 4–7 clusters. There are no outliers, and the final weighted mean value of 5 is calculated. As Fig. 3.8.2(b) shows, the optimum points for the C and γ tests are often quite weakly defined (Barr et al., 2004b).

This is, in its essentials, the particleinabox problem. Each powder pattern is represented as a single sphere, and these spheres are placed in a cubic box of unit dimensions such that the positions of the spheres reproduce as closely as possible the distance matrix, d, generated from correlating the patterns. The spheres have an arbitrary orientation in the box.
To do this, the (n × n) distance matrix d is used in conjunction with metric multidimensional scaling (MMDS) to define a set of p underlying dimensions that yield a Euclidean distance matrix, d^{calc}, whose elements are equivalent to or closely approximate the elements of d.
The method works as follows (Cox & Cox, 2000; Gower, 1966; Gower & Dijksterhuis, 2004).
The matrix d has zero diagonal elements, and so is not positive semidefinite. A positive definite matrix, A(n × n) can be constructed, however, by computingwhere I_{n} is an (n × n) identity matrix, i_{n} is an (n × 1) vector of unities and D is defined in equation (3.8.8). The matrix is called a centring matrix, since A has been derived from D by centring the rows and columns.
The eigenvectors v_{1}, v_{2},…, v_{n} and the corresponding eigenvalues λ_{1}, λ_{2},…, λ_{n} are then obtained. A total of p eigenvalues of A are positive and the remaining (n − p) will be zero. For the p nonzero eigenvalues a set of coordinates can be defined via the matrix X(n × p),where Λ is the vector of eigenvalues.
If p = 3, then we are working in three dimensions, and the X(n × 3) matrix can be used to plot each pattern as a single point in a 3D graph. This assumes that the dimensionality of the problem can be reduced in this way while still retaining the essential features of the data. As a check, a distance matrix d^{calc} can be calculated from X(n × 3) and correlated with the observed matrix d using both the Pearson and Spearman correlation coefficients. In general the MMDS method works well, and correlation coefficients greater than 0.95 are common. For large data sets this can reduce to ∼0.7, which is still sufficiently high to suggest the viability of the procedure. Parallel coordinates based on the MMDS analysis can also be used, and this is discussed in Sections 3.8.4.2.1 and 3.8.4.2.2.
There are occasions in which the underlying dimensionality of the data is 1 or 2, and in these circumstances the data project onto a plane or a line in an obvious way without any problems.
An example of an MMDS plot is shown in Fig. 3.8.6(b), which is linked to the dendrogram in Fig. 3.8.6(a).
It is also possible to carry out principalcomponent analysis (PCA) on the correlation matrix. The eigenvalues of the correlation matrix can be used to estimate the number of clusters present via a scree plot, as shown in Fig. 3.8.2(a), and the eigenvectors can be used to generate a score plot, which is an X(n × 3) matrix and can be used as a visualization tool in exactly the same way as the MMDS method to indicate which patterns belong to which class. Score plots traditionally use two components with the data thus projected on to a plane; we use 3D plots in which three components are represented. In general, we find that the MMDS representation of the data is nearly always superior to the PCA analysis for powder and spectroscopic data.
It is possible to use the MMDS plot (or, alternatively, PCA score plots) to assist in the choice of clustering method, since the two methods operate semiindependently. The philosophy here is to choose a technique that results in the tightest, most isolated clusters as follows:
Similar techniques can be used to identify the most representative sample in a cluster. This is defined as the sample that has the minimum mean distance from every other sample in the clusters, i.e. for cluster J containing m patterns, the most representative sample, i, is defined as that which givesThe most representative sample is useful in visualization and can, with care, be used to create a database of known phases (Barr et al., 2004b).
Amorphous samples are an inevitable consequence of highthroughput experiments, and need to be handled correctly if they are not to lead to erroneous indications of clustering. To identify amorphous samples the total background for each pattern is estimated and its intensity integrated; the integrated intensity of the nonbackground signal is then calculated. If the ratio falls below a preset limit (usually 5%, but this may vary with the type of samples under study) the sample is treated as amorphous. The distance matrix is then modified so that each amorphous sample is given a distance and dissimilarity of 1.0 from every other sample, and a correlation coefficient of zero. This automatically excludes the samples from the clustering until the last amalgamation steps, and also limits their effect on the estimation of the number of clusters (Barr et al., 2004b). Of course, the question of amorphous samples is not a binary (yes/no) one: there are usually varying degrees of amorphous content, which further complicates matters.
It is important when dealing with large data sets to have suitable visualization tools. These tools are also a valuable resource for exploring smaller data sets. This methodology provides four primary aids:
These aids give graphical views of the data that are semiindependent and thus can be used to check for consistency and discrepancies in the clustering. They are also interactive. No one method is optimal, and a combination of mathematical and visualization techniques is required, techniques that often need tuning for each individual application (Barr, Cunningham et al., 2009; Barr, Dong & Gilmore, 2009).
3.8.4.2. Secondary visualization using parallel coordinates, the grand tour and minimum spanning trees
In the MMDS and PCA methods p = 3 [equation (3.8.16)] to work in three dimensions; the X matrix can then be used to plot each pattern as a single point in a 3D graph. However, this has reduced the dimensionality of the data to three, and the question arises as to the validity of this: are three dimensions sufficient? The use of parallelcoordinates plots coupled with the grand tour can assist here as well as giving us an alternative view of the data.
A parallelcoordinates plot is a graphical dataanalysis technique for plotting multivariate data. Usually orthogonal axes are used when doing this, but in parallelcoordinates plots orthogonality is abandoned and replaced with a set of N equidistant parallel axes, one for each variable and labelled X1, X2, X3,…, XN (Inselberg, 1985, 2009; Wegman, 1990). Each data point is plotted on each axis and the points are joined via a line connecting each data point. The data now become a set of lines. The lines are given the colours of the cluster to which they belong as defined by the current dendrogram. A parallelcoordinates display can be interpreted as a generalization of a twodimensional scatterplot, and it allows the display of an arbitrary number of dimensions. The method can also be used to validate the clustering itself without using dendrograms. Using this technique it is possible to determine whether the clustering shown by the MMDS (or PCA) plot in three dimensions continues in higher dimensions.
Fig. 3.8.3 shows a typical example for a set of 80 organic samples partitioned into four clusters (Barr, Dong & Gilmore, 2009). The plot shows that the clustering looks realistic when viewed in this way and that it is maintained when the data are examined in six dimensions.
The grand tour is a method of animating the parallelcoordinates plot to examine it from all possible viewpoints. Consider a 3D data plot using orthogonal axes: a grand tour takes 2D sections through these data and displays them in parallelcoordinates plots in a way that explores the entire space in a continuous way. The former is important, because the data can be seen from all points of view, and the latter allows the user the follow the data without abrupt discontinuities. This concept was devised by Asimov (1985) and further developed by Wegman (1990). In more than three dimensions it becomes a generalized rotation of all the coordinate axes. A ddimensional tour is a continuous geometric transformation of a ddimensional coordinate system such that all possible orientations of the coordinate axes are eventually achieved. The algorithm for generating a smooth and complete view of the data is described by Asimov (1985).
To do this, the restriction of p = 3 in the MMDS calculation is relaxed to 6, so that there is now a 6D data set with six orthogonal axes. The choice of six is somewhat arbitrary – more can be used, but six is sufficient to see whether the clustering is maintained without generating unduly complex plots and requiring extensive computing resources. The data are plotted as a parallelcoordinates plot. The grandtour method is then applied by a continuous geometric transformation of the 6D coordinate system such that all possible orientations of the axes are achieved. Each orientation is reproduced as a parallelcoordinates plot using six axes.
Figs. 3.8.9(j) and (k) show an example from the clustering of the 13 aspirin samples using PXRD data. Fig. 3.8.9(j) shows the default parallelcoordinates plot. Fig. 3.8.9(k) shows alternative views of the data taken from the grand tour. In Fig. 3.8.9(j) there appears to be considerable overlap between clusters in the 4th, 5th and 6th dimensions (X4, X5 and X6), but the alternative view given in Fig. 3.8.9(k) show that the clustering is actually well defined in all six dimensions (Barr, Dong & Gilmore, 2009).
The minimum spanning tree (MST) displays the MMDS plot as a tree whose points are the data from the MMDS calculation (in three dimensions) and whose weights are the distances between these points. The minimumspanningtree problem is that of joining the points with a minimum total edge weight. (As an example, airlines use minimum spanning trees to work out their basic route systems: the best set of routes taking into account airport hubs, passenger numbers, fuel costs etc. is the minimum spanning tree.) Because a tree is used, each point is only allowed a maximum of three connections to other points.
To do this Kruskal's (1956) algorithm can be used, in which the lowest weight edge is always added to see if it builds a spanning tree; if so, it is added or otherwise discarded. This process continues until the tree is constructed. An example is shown in Figs. 3.8.7 for the 13sample aspirin data. A complete tree for this data set using three dimensions and the MMDSderived coordinates is shown in Fig. 3.8.7(a). This has 12 links between the 13 data points. Reducing the number of links to 10 gives Fig. 3.8.7(b).
Other techniques exist to validate the clusters, and these are discussed here.
Silhouettes (Rousseeuw, 1987; Kaufman & Rousseeuw, 1990) are a property of every member of a cluster and define a coefficient of cluster membership. To compute them, the dissimilarity matrix, δ, is used. If the pattern i belongs to cluster C_{r} which contains n_{r} patterns, we defineThis defines the average dissimilarity of pattern i to all the other patterns in cluster C_{r}. Further defineThe silhouette for pattern i is thenClearly . It is not possible to define silhouettes for clusters with only one member (singleton clusters). Silhouettes are displayed such that each cluster is represented as a histogram of frequency plotted against silhouette values so that one can look for outliers or poorly connected plots.
From our experience with powder data collected in reflection mode on both organic and inorganic samples (Barr et al., 2004b), we conclude that for any given pattern
The use of silhouettes in defining the details of the clustering is shown for the aspirin data in Fig. 3.8.8. The silhouettes for the red cluster corresponding to the dendrogram in Fig. 3.8.6(a) are shown in Fig. 3.8.8(a) and those for the corresponding orange cluster are shown in Fig. 3.8.8(b). Both sets of silhouettes have values < 0.5, which indicates that the clustering is not optimally defined. When the cut line is moved to give the dendrogram in Fig. 3.8.6(c), the silhouettes for the red cluster are shown in Fig. 3.8.8(c). The entry centred on a silhouette value of 0.15 is pattern 3. This implies that pattern 3 is only loosely connected to the cluster and this is demonstrated in Fig. 3.8.8(d) where pattern 3 and the most representative pattern for the cluster (No. 9) are superimposed. Although there is a general sense of similarity there are significant differences and the combined correlation coefficient is only 0.62. In Fig. 3.8.8(e), the silhouettes for the orange cluster are shown. They imply that this is a single cluster without outliers. The silhouettes for the green cluster corresponding to the dendrogram in Fig. 3.8.6(c) are shown in Fig. 3.8.8(f). The clustering is poorly defined here.
In standard clustering methods a set of n diffraction patterns are partitioned into c disjoint clusters. Cluster membership is defined via a membership matrix U(n × c), where individual coefficients, u_{ik}, represent the membership of pattern i of cluster k. The coefficients are equal to unity if i belongs to c and zero otherwise, i.e.If these constraints are relaxed, such thatandthen fuzzy clusters are generated, in which there is the possibility that a pattern can belong to more than one cluster (see, for example, Everitt et al., 2001; Sato et al., 1966). Such a situation is quite feasible in the case of powder diffraction, for example, when mixtures can be involved. It is described in detail by Barr et al. (2004b).
All these techniques have been incorporated into the PolySNAP computer program (Barr et al., 2004a,b,c; Barr, Dong, Gilmore & Faber, 2004; Barr, Dong & Gilmore, 2009), which was developed from the SNAPD software (Barr, Gilmore & Paisley, 2004). PolySNAP has subsequently been incorporated into the Bruker DIFFRAC.EVA program (Bruker, 2018), and the following sections are based on its use.
All the elements for clustering and visualization are now in place. Fig. 3.8.4 shows this as a flowchart. Hitherto we have looked at elements of the aspirin data to demonstrate how methods work; we now examine the aspirin data in detail as a single analysis.
In this example we use 13 powder patterns from commercial aspirin samples collected in reflection mode on a Bruker D8 diffractometer. Since these samples include fillers, the active pharmaceutical ingredient (API) and other formulations, it is not surprising that peak widths are high: ∼0.5° full width at half maximum (FWHM). The datacollection range was 10–43° in 2θ using Cu Kα radiation. The 13 powder data sets are shown in Fig. 3.8.5 arranged into groups based on similarity. We have already described the methods of analysis and have shown typical results in Figs. 3.8.6 to 3.8.8, and now present detailed examples. The correlation matrix derived from equation (3.8.3) is shown in Fig. 3.8.9(a), colour coded to reflect the values of the coefficients; the darker the shade, the higher the correlation. The resulting dendrogram and MMDS plot are shown in Figs. 3.8.9(b) and (c), respectively. Four clusters are identified in the dendrogram and these have been appropriately coloured. Other visualization tools are now shown. In Fig. 3.8.9(d) the pie chart is displayed; the number of rows can be adjusted to reflect the arrangement of the samples in a multiple sample holder. Fig. 3.8.9(e) shows the default minimum spanning tree with 12 links. In Fig. 3.8.9(f) the scree plot indicates that three clusters will account for more than 95% of the data variability. The steep initial slope is a clear indication of good cluster estimation. The silhouettes are shown in Fig. 3.8.9(g–i). These were discussed in Section 3.8.5.1. In Fig. 3.8.9(j) the default parallelcoordinates plot for the same data is shown, and in Fig. 3.8.9(k) there is another view taken from the grand tour. These two plots validate the clustering and also indicate that there is no significant error introduced into the MMDS plot by truncating it into three dimensions.

The use of minimum spanning trees (MSTs). (a) The MST with 12 links. (b) The MST with 10 links; three clusters are now present. 
As a demonstration of the handling of data from amorphous samples, five patterns for amorphous samples were included in the aspirin data and the clustering calculation was repeated. The results are shown in Fig. 3.8.10. Fig. 3.8.10(a) shows the dendrogram. It can be seen that the amorphous samples are positioned as isolated clusters on the righthand end. They also appear as an isolated cluster in the MMDS plot and the parallelcoordinates plots, as shown in Figs. 3.8.10(b) and (c). It could be argued that these samples should be treated as a single, fivemembered cluster rather than five individuals, but we have found that this confuses the clustering algorithms, and it is clearer to the user if the data from amorphous samples are presented as separate classes.
Ammonium nitrate exhibits temperatureinduced phase transformations. Between 256 and 305 K it crystallizes in the orthorhombic space group Pmmm with a = 5.745, b = 5.438, c = 4.942 Å and Z = 2; from 305 to 357 K it crystallizes in Pbnm with a = 7.14, b = 7.65, c = 5.83 Å with Z = 4; between 357 and 398 K it crystallizes in the tetragonal space group with a = 5.719, c = 4.932 Å, Z = 2, and above 398 K it transforms to the cubic space group with a = 4.40 Å and Z = 1. PXRD data containing 75 powder patterns taken at intervals of 3 K starting at 203 K using a D5000 Siemens diffractometer and Cu Kα radiation with a 2θ range of 10–100° were used (Herrmann & Engel, 1997). Fig. 3.8.11(a) shows the data in the 2θ range 17–45°.
The visualization of these data following cluster analysis is shown in Fig. 3.8.11(b) using an MMDS plot on which has been superimposed a line showing the route followed by the temperature increments. The purple line follows the transition from a mixture of forms IV and V at low temperature (red) through form IV (yellow), form II (blue) and finally form I at high temperature (green). This is an elegant and concise representation of the data in a single diagram.
Since mixtures are so common in highthroughput experiments, and indeed in many situations with multiple data sets, it is useful to have a method of automatic quantitative analysis. The quality of data that results from highthroughput crystallography makes it unlikely that an accuracy better than 5–10% can be achieved but, nonetheless, the identification of mixtures can be carried out by wholeprofile matching. First a database of N pure phases is created, or, if that is not possible, then the most representative patterns with appropriate safeguards can be used. Assume that there is a sample pattern, S, which is considered to be a mixture of up to N components. S comprises m data points, S_{1}, S_{2}, …, S_{m}. The N patterns can be considered to make up fractions p_{1}, p_{2}, p_{3}, …, p_{N} of the sample pattern. The best possible combination of the database patterns to fit the sample pattern is required. A system of linear equations can be constructed in which x_{11} is measurement point 1 of pattern 1 etc.:
Writing these in matrix form, we getorA solution for S that minimizesis required. Since , the system is heavily overdetermined, and leastsquares or singular value decomposition can be used to solve (3.8.29) for the fractional percentages arising from the scattering power of the component mixtures, . The values of s can be used to calculate a weight fraction for that particular phase provided that the atomic absorption coefficients are known, and this in turn requires the unitcell dimensions and cell contents, but not the atomic coordinates (Smith et al., 1988; Cressey & Schofield, 1996). The general formula for the weight fraction of component n in a mixture comprising N components is (Leroux et al., 1953)whereandwhere μ_{j} is the atomic Xray absorption coefficient and ρ_{j} is the density of component j. For polymorphs, the absorption coefficients are sufficiently close and the method sufficiently approximate that the effects of absorption can be ignored.
As an example, a set of 19 patterns from set 78 of the ICDD database for inorganic compounds (ICDD, 2018) was imported into DIFFRAC.EVA. To this was added some simulated mixture data generated by adding the patterns for lanthanum strontium copper oxide and caesium thiocyanate in the proportions 80/20, 60/40, 50/50, 40/60 and 20/80. Two calculations were performed: an analysis without the purephase database and a second where the pure phases of lanthanum strontium copper oxide and caesium thiocyanate were present.
The results are shown in Fig. 3.8.12. In the MMDS plot the green spheres represent pure lanthanum strontium copper oxide while the yellow are pure caesium thiocyanate. The red spheres represent mixtures of the two. The latter form an arc between the green and yellow clusters. The distance of the spheres representing mixtures from the lanthanum strontium copper oxide and caesium thiocyanate spheres gives a semiquantitative representation of the mixture contents. Running the analysis in quantitative mode gives the pie charts also shown in Fig. 3.8.12; they reproduce exactly the relative proportions of the three components.
For further details of this method with organic samples, see Dong et al. (2008).
There is no reason why the methodology described in this chapter cannot be used for other 1D data sets, e.g. Raman, IR, NMR and nearIR spectroscopies, although different data preprocessing is usually required. Raman spectroscopy is well suited to highthroughput screening: goodquality spectra can be collected in a few minutes, and sample preparation is straightforward and flexible, although the resulting spectra are not always as distinct as the PXRD equivalents (Mehrens et al., 2005; Boccaleri et al., 2007).
As an example we show the results of cluster analysis carried out on samples of carbamazepine, cimetidene, furosemide, mefenamic acid, phenilbutazone and sulfamerazine using Raman spectroscopy. A total of 74 samples were measured on a LabRam HR800/HTSMultiwell spectrometer at room temperature, equipped with a backscattering light path system of a lightemitting diode laser (785 nm, 300 mW) as an excitation source and an aircooled chargecoupled device detector. A 20fold superlong working distance objective lens was used to collect the backscattered light. The spectra were acquired with 5.84 cm^{−1} spectral width and at least 30 s exposure (Kojima et al., 2006). The spectra had backgrounds subtracted but no other corrections were carried out.
The initial clustering is shown in Fig. 3.8.13(a) with the default cut level in the dendrogram. There are six clusters: labelling from the lefthand side, the red are three polymorphs of carbamazepine; the orange are cimetidene; the green cluster contains three polymorphs of furosemide; the light blue contains three polymorphs of mefenamic acid; the dark blue contains phenilbutazone; and finally the purple cluster contains sulfamerazine. The MMDS plot gives a complementary visualization of the data that supports the clustering.
It is also possible to use derivative data in place of the original spectra for clustering. The results of this for the 74 Raman spectra without initial background subtraction followed by the generation of firstderivative data are shown in Fig. 3.8.14. The clusters are well defined but now the carbamazepine data have split into two clusters. These correspond to forms I and III of carbamazepine, although the differences in the Raman spectra for these three species are small (O'Brien et al., 2004). At the same time, both furosemide and mefenamic acid are each split into two groups. This is probably the best description of the data in terms of clustering and cluster membership corresponding to the chemical differences in the samples. The dendrogram also has the feature that the tie bars between samples are higher, i.e. the similarities are lower, reflecting the fact that the use of first derivatives accentuates small differences in the data.
It is interesting to note that, in general, PXRD works less well with derivative data. The reason for this is not clear, but possibly the presence of partial overlapping peaks and the associated issues of peak shape are partly responsible.
It is now common to collect more than one data type, and some instruments now exist for collecting spectroscopic and PXRD data on the same samples, for example the Bruker D8 Screenlab, which combines PXRD and Raman measurement for highthroughput screening (Boccaleri et al., 2007).
A technique for combining the results of more than one data type is needed. One method would be to take individual distance matrices from each data type and generate an average distance matrix using equation (3.8.3), but this leaves open the question of how best to define the associated weights in an optimal, objective way. Should, for example, PXRD be given a higher weight than Raman data? The individual differences scaling method (INDSCAL) of Carroll & Chang (1970) provides an unbiased solution to this problem by, as the name suggests, scaling the differences between individual distance matrices.
In this method, let D_{k} be the squared distance matrix of dimension (n × n) for data type k with a total of K data types. For example, if we have PXRD, Raman and differential scanning calorimetry (DSC) data for each of n samples, then K = 3. A groupaverage matrix G (which we will specify in two dimensions) is required that best represents the combination of the K data types. To do this, the D matrices are first put into innerproduct form by the doublecentring operation to givewhere I is the identity matrix and N is the centring matrix I − 11′/N; 1 is a column vector of ones. The innerproduct matrices thus generated are matched to the weighted form of the group average, G, which is unknown. To do this the functionis minimized. The weight matrices, W_{k}, are scaled such thatThe INDSCAL method employs an iterative technique to solve equation (3.8.7) in which one parameter is kept fixed whilst the other is determined by leastsquares refinement. An initial estimate for G is taken either from the average of the D matrices for each sample or as a random matrix. This is then used to estimate the weight matrices, and the whole process repeated until a minimum value of S is obtained. The algorithm derived by Carroll and Chang was used in the example below. When random matrices are used to generate the initial G matrix, the INDSCAL procedure is repeated 100 times and the solution with the minimum value of S is kept. In practice, there is very little difference in the results of these two procedures. The resulting G matrix is used as a standarddistance matrix, and used in the standard way to generate dendrograms, MMDS plots etc. The method has the property that where data types show samples to be very similar this is reinforced, whereas where there are considerable variations the differences are accentuated in the final G matrix. For a fuller description of the INDSCAL method with examples see Gower & Dijksterhuis (2004), Section 13.2, and for a useful geometric interpretation see Husson & Pagès (2006).
We now present an example of the INDSCAL method applied to data collected on sulfathiazole using PXRD and Raman spectroscopy (Barr, Cunningham et al., 2009). A flowchart is shown in Fig. 3.8.15. Three polymorphs of sulfathiazole were prepared and PXRD data were collected on a Bruker C2 GADDS system. Each sample was run for 2 min over a 3–30° range in 2θ using Cu Kα radiation. Raman data were collected on a Bruker SENTINEL. The Raman probe was integrated into the PXRD instrument.

A flowchart for the INDSCAL method using Raman and PXRD data. Note that any combination of any 1D data can be used here. 
The only data preprocessing performed was background removal. Fig. 3.8.16(a) shows the resulting dendrogram (with the default cut level) and Fig. 3.8.16(b) shows the corresponding MMDS plot. To identify each sample they are numbered via a fourdigit code: the first two digits are the well number, and the last digit defines whether the sample is form 2, 3 or 4 of sulfathiazole. It can be seen that the clustering is only partly successful: form 4 (red) is correctly clustered; form 3 (orange) gives five clusters and form 2 gives three clusters.
Fig. 3.8.16(c) shows the clustering from the Raman spectra. The results are poor: most of form 2 is correctly clustered, but forms 4 and 3 are intermixed, and the MMDS plot in Fig. 3.8.16(d) is diffuse with little structure.
The INDSCAL method is now applied starting from random G matrices and the results are shown in Fig. 3.8.16(e) and (f) with the dendrogram cut level at its default value. The clustering is almost correct; all the samples are placed in the correct groups except that there are two outliers coloured in blue. Fig. 3.8.16(g) shows the Raman patterns for these samples: they are primarily background with very little usable signal.
Quality control (Gilmore, Barr & Paisley, 2009) is designed for situations where the stability of a material is being monitored over time, for example as part of a productionline system, or for periodic equipment alignment. A set of reference patterns is collected that represents acceptable measurements – any measurement sufficiently close to these references represents a good measurement. Various sample patterns are then imported and compared with those reference patterns, and any that vary significantly from the ideal are noted and highlighted.
The results are best displayed graphically using a variant of the MMDS method, of which an example is shown in Fig. 3.8.17. The reference patterns define a green shaded surface with acceptable sample patterns, coloured red, shown within it, and potentially problematic sample patterns appearing outside it. The volume of the green shape is defined by intersecting spheres around each reference sample and these can be altered to allow more or lessstringent quality control.
These calculations can be carried out using MATLAB (http://www.mathworks.co.uk/products/matlab/ ) or the opensource R software (http://www.rproject.org/; Crawley, 2007) with graphics using the GGobi software (Cook & Swayne, 2007). There are four commercial packages for handling powder data: DIFFRAC.EVA and PolySNAP 3, both from Bruker (http://www.bruker.com/; Gilmore, Barr & Paisley, 2004; Barr, Dong & Gilmore, 2009), Jade from Materials Data Inc. (http://www.materialsdata.com) and HighScore from Malvern PANalytical (http://www.panalytical.com/).
Acknowledgements
Data were kindly provided by Gordon Cunningham at Glasgow University, Arnt Kern of Bruker AXS in Karlsruhe and Michael Herrman at the Fraunhofer Institute, Germany.
References
Alvarez, A. J., Singh, A. & Myerson, A. S. (2009). Polymorph screening: comparing a semiautomated approach with a high throughput method. Cryst. Growth Des. 9, 4181–4188.Google ScholarAsimov, D. (1985). The grand tour: a tool for viewing multidimensional data. SIAM J. Sci. Stat. Comput. 6, 128–143.Google Scholar
Barr, G., Cunningham, G., Dong, W., Gilmore, C. J. & Kojima, T. (2009). Highthroughput powder diffraction V: the use of Raman spectroscopy with and without Xray powder diffraction data. J. Appl. Cryst. 42, 706–714.Google Scholar
Barr, G., Dong, W. & Gilmore, C. J. (2004a). Highthroughput powder diffraction. II. Applications of clustering methods and multivariate data analysis. J. Appl. Cryst. 37, 243–252.Google Scholar
Barr, G., Dong, W. & Gilmore, C. J. (2004b). Highthroughput powder diffraction. IV. Cluster validation using silhouettes and fuzzy clustering. J. Appl. Cryst. 37, 874–882.Google Scholar
Barr, G., Dong, W. & Gilmore, C. J. (2004c). PolySNAP: a computer program for analysing highthroughput powder diffraction data. J. Appl. Cryst. 37, 658–664.Google Scholar
Barr, G., Dong, W. & Gilmore, C. J. (2009). PolySNAP3: a computer program for analysing and visualizing highthroughput data from diffraction and spectroscopic sources. J. Appl. Cryst. 42, 965–974.Google Scholar
Barr, G., Dong, W., Gilmore, C. & Faber, J. (2004). Highthroughput powder diffraction. III. The application of fullprofile pattern matching and multivariate statistical analysis to roundrobintype data sets. J. Appl. Cryst. 37, 635–642.Google Scholar
Barr, G., Gilmore, C. J. & Paisley, J. (2004). SNAP1D: a computer program for qualitative and quantitative powder diffraction pattern analysis using the full pattern profile. J. Appl. Cryst. 37, 665–668.Google Scholar
Boccaleri, E., Carniato, F., Croce, G., Viterbo, D., van Beek, W., Emerich, H. & Milanesio, M. (2007). In situ simultaneous Raman/highresolution Xray powder diffraction study of transformations occurring in materials at nonambient conditions. J. Appl. Cryst. 40, 684–693.Google Scholar
Bruker (2018). DIFFRAC.EVA: software to evaluate Xray diffraction data. Version 4.3. https://www.bruker.com/eva .Google Scholar
Butler, B. M., Sila, A., Nyambura, K. D., Gilmore, C. J., Kourkoumelis, N. & Hillier, S. (2019). Pretreatment of soil Xray powder diffraction data for cluster analysis. Geoderma, 337, 413–424.Google Scholar
Calinški, T. & Harabasz, J. (1974). A dendritic method for cluster analysis. Commun. Stat. 3, 1–27.Google Scholar
Carroll, J. D. & Chang, J. J. (1970). Analysis of individual differences in multidimensional scaling via nway generalization of `Eckhart–Young' decomposition. Psychometria, 35, 283–319.Google Scholar
Cook, D. & Swayne, D. F. (2007). Interactive and Dynamic Graphics for Data Analysis with R and GGobi. New York: Springer.Google Scholar
Cox, T. F. & Cox, M. A. A. (2000). Multidimensional Scaling, 2nd ed. Boca Raton: Chapman & Hall/CRC.Google Scholar
Crawley, M. J. (2007). The R Book. Chichester: Wiley.Google Scholar
Cressey, G. & Schofield, P. F. (1996). Rapid wholepattern profilestripping method for the quantification of multiphase samples. Powder Diffr. 11, 35–39.Google Scholar
Dong, W., Gilmore, C., Barr, G., Dallman, C., Feeder, N. & Terry, S. (2008). A quick method for the quantitative analysis of mixtures. 1. Powder Xray diffraction. J. Pharm. Sci. 97, 2260–2276.Google Scholar
Donoho, D. L. & Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. J. Am. Stat. Assoc. 90, 1200–1224.Google Scholar
Everitt, B. S., Landau, S. & Leese, M. (2001). Cluster Analysis, 4th ed. London: Arnold.Google Scholar
Gilmore, C. J., Barr, G. & Paisley, W. (2004). Highthroughput powder diffraction. I. A new approach to qualitative and quantitative powder diffraction pattern analysis using full pattern profiles. J. Appl. Cryst. 37, 231–242.Google Scholar
Goodman, L. A. & Kruskal, W. H. (1954). Measures of association for crossclassifications. J. Am. Stat. Assoc. 49, 732–764.Google Scholar
Gordon, A. D. (1981). Classification, 1st ed., pp. 46–49. London: Chapman and Hall.Google Scholar
Gordon, A. D. (1999). Classification, 2nd ed. Boca Raton: Chapman and Hall/CRC.Google Scholar
Gower, J. C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53, 325–328.Google Scholar
Gower, J. C. & Dijksterhuis, G. B. (2004). Procrustes Problems. Oxford University Press.Google Scholar
Herrmann, M. J. & Engel, W. (1997). Phase transitions and lattice dynamics of ammonium nitrate. Propellants Explos. Pyrotech. 22, 143–147.Google Scholar
Husson, F. & Pagès, J. (2006). INDSCAL model: geometrical interpretation and methodology. Comput. Stat. Data Anal. 50, 358–378.Google Scholar
ICDD (2018). The Powder Diffraction File. International Centre for Diffraction Data, 12 Campus Boulevard, Newton Square, Pennsylvania 190733273, USA.Google Scholar
Inselberg, A. (1985). The plane with parallel coordinates. Vis. Comput. 1, 69–91.Google Scholar
Inselberg, A. (2009). Parallel Coordinates. Visual multidimensional geometry and its applications. New York: Springer.Google Scholar
Kaufman, L. & Rousseeuw, P. J. (1990). Finding Groups in Data. New York: Wiley.Google Scholar
Kojima, T., Onoue, S., Murase, N., Katoh, F., Mano, T. & Matsuda, Y. (2006). Crystalline form information from multiwell plate salt screening by use of Raman microscopy. Pharm. Res. 23, 806–812.Google Scholar
Kruskal, J. B. (1956). On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Am. Math. Soc. 7, 48–50.Google Scholar
Lance, G. N. & Williams, W. T. (1967). A general theory of classificatory sorting strategies: 1. Hierarchical systems. Comput. J. 9, 373–380.Google Scholar
Leroux, J., Lennox, D. H. & Kay, K. (1953). Direct quantitative Xray analysis by diffraction absorption technique. Anal. Chem. 25, 740–743.Google Scholar
Mehrens, S. M., Kale, U. J. & Qu, X. (2005). Statistical analysis of differences in the Raman spectra of polymorphs. J. Pharm. Sci. 94, 1354–1367.Google Scholar
Milligan, G. W. & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159–179.Google Scholar
Nelder, J. A. & Mead, R. (1965). A simplex method for function minimization. Comput. J. 7, 308–313.Google Scholar
O'Brien, L. E., Timmins, P., Williams, A. C. & York, P. (2004). Use of in situ FTRaman spectroscopy to study the kinetics of the transformation of carbamazepine polymorphs. J. Pharm. Biomed. Anal. 36, 335–340.Google Scholar
Ogden, R. T. (1997). Essential Wavelets for Statistical Applications and Data Analysis, pp. 144–148. Boston: Birkhäuser.Google Scholar
Press, W. H., Teukolsky, S. A., Vetterling, W. T. & Flannery, B. P. (2007). Numerical Recipes. 3rd ed. Cambridge University Press..Google Scholar
Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65.Google Scholar
Sato, M., Sato, Y. & Jain, L. C. (1966). Fuzzy Clustering Models and Applications. New York: PhysicaVerlag.Google Scholar
Smith, D. K., Johnson, G. G. & Wims, A. M. (1988). Use of full diffraction spectra, both experimental and calculated, in quantitative powder diffraction analysis. Aust. J. Phys. 41, 311–321.Google Scholar
Smrčok, Ĺ., Ďurík, M. & Jorík, V. (1999). Wavelet denoising of powder diffraction patterns. Powder Diffr. 14, 300–304.Google Scholar
Spearman, C. (1904). The proof and measurement of association between two things. Am. J. Psychol. 15, 72–101.Google Scholar
Storey, R., Docherty, R., Higginson, P., Dallman, C., Gilmore, C., Barr, G. & Dong, W. (2004). Automation of solid form screening procedures in the pharmaceutical industry–how to avoid the bottlenecks. Crystallogr. Rev. 10, 45–56.Google Scholar
Wegman, E. J. (1990). Hyperdimensional data analysis using parallel coordinates. J. Am. Stat. Assoc. 85, 664–675.Google Scholar