International
Tables for
Crystallography
Volume H
Powder diffraction
Edited by C. J. Gilmore, J. A. Kaduk and H. Schenk

International Tables for Crystallography (2018). Vol. H, ch. 3.8, pp. 325-343
https://doi.org/10.1107/97809553602060000953

Chapter 3.8. Clustering and visualization of powder-diffraction data

C. J. Gilmore,a G. Barra and W. Donga*

aDepartment of Chemistry, University of Glasgow, University Avenue, Glasgow, G12 8QQ, UK
Correspondence e-mail:  chris@chem.gla.ac.uk

The use of clustering to classify powder X-ray diffraction patterns is presented. The associated visualization techniques are also presented. The methods provide an easy way to handle large volumes of data.

3.8.1. Introduction

| top | pdf |

In high-throughput crystallography, crystallization experiments using robotics coupled with automatic sample changers and two-dimensional (2D) detectors can generate and measure over 1000 powder-diffraction patterns on a series of related compounds, often polymorphs or salts, in a day (Storey et al., 2004[link]). It is also possible to simultaneously measure spectroscopic data, especially Raman (Alvarez et al., 2009[link]). The analysis of these patterns poses a difficult statistical problem: a need to classify the data by putting the samples into clusters based on diffraction-pattern similarity so that unusual samples can be readily identified. At the same time, suitable visualization tools to help in the data-classification process are required; the techniques of classification and visualization go hand-in-hand. Interestingly, the techniques developed for large data sets with poor-quality data also have great value when looking at smaller data sets, and the visualization tools developed for high-throughput studies are especially useful when looking at phase transitions, mixtures etc.

In this chapter the methods for comparing whole patterns will be described. The mathematics of cluster analysis will then be explained, followed by a discussion of the associated visualization tools. Examples using small data sets from pharmaceuticals, inorganics and phase transitions will be given; the techniques used can be readily scaled up for handling large, high-throughput data sets. The same methods also work for spectroscopic data and the use of such information with and without powder X-ray diffraction (PXRD) data will be discussed. Finally, the use of visualization tools in quality control is demonstrated.

3.8.2. Comparing 1D diffraction patterns

| top | pdf |

Comparing 1D diffraction patterns or spectra cannot be done by simply using the peaks and their relative intensities for a number of reasons:

  • (1) The accurate determinations of the peak positions may be difficult, especially in cases where peak overlap occurs or there is significant peak asymmetry.

  • (2) The hardware and the way in which the sample is prepared can also affect the d-spacing (or 2θ value) that is recorded for the peak. Shoulders to main peaks and broad peaks can also be problematic.

  • (3) There is a subjective element to deciding how many peaks there are in the pattern, especially for weak peaks and noisy data.

  • (4) Weak peaks may be discarded. This can affect the quantitative analysis of mixtures if one component diffracts weakly or is present only in small amounts.

  • (5) Differences in sample preparation and instrumentation can lead to significant differences in the powder-diffraction patterns of near-identical samples.

  • (6) Preferred orientation may be present: this is a very difficult and common problem.

  • (7) The reduction of the pattern to point functions can also make it difficult to design effective algorithms.

In order to use the information contained within the full profile, algorithms are required that utilize each measured data point in the analysis. We use two correlation coefficients for the purpose of comparing PXRD patterns: the Pearson and the Spearman coefficients.

3.8.2.1. Spearman's rank order coefficient

| top | pdf |

Consider two diffraction patterns, i and j, each with n measured points n((x1, y1), …, (xn, yn)). These are transformed to ranks R(xk) and R(yk). The Spearman test (Spearman, 1904[link]) then gives a correlation coefficient (Press et al., 2007[link]),[{R_{ij}} = {{\displaystyle\sum\limits_{k = 1}^n {R({x_k})R({y_k}) - n{{\left(\displaystyle{{{n + 1}\over 2}} \right)}^2}} } \over {{{\left({\displaystyle\sum\limits_{k = 1}^n {R{{({x_k})}^2} - n{{\left(\displaystyle{{{n + 1}\over 2}} \right)}^2}} } \right)}^{1/2}}{{\left({\displaystyle\sum\limits_{k = 1}^n {R{{({y_k})}^2} - n{{\left(\displaystyle{{{n + 1}\over 2}} \right)}^2}} } \right)}^{1/2}}}}, \eqno(3.8.1)]where [ - 1\, \le \,{R_{ij}}\, \le \,1].

3.8.2.2. Pearson's r coefficient

| top | pdf |

Pearson's r is a parametric linear correlation coefficient widely used in crystallography. It has a similar form to Spearman's test, except that the data values themselves, and not their ranks, are used:[{r_{ij}} = {{\sum\limits_{k = 1}^n {\left({{x_k} - \overline x } \right)} \left({{y_k} - \overline y } \right)} \over {{{\left [{\sum\limits_{k = 1}^n {{{\left({{x_k} - \,\overline x } \right)}^2}} \sum\limits_{k = 1}^n {{{\left({{y_k} - \,\overline y } \right)}^2}} } \right]}^{1/2}}}}, \eqno(3.8.2)]where [\overline x] and [\overline y] are the means of intensities taken over the full diffraction pattern. Again, r can lie between −1.0 and +1.0.

Fig. 3.8.1[link] shows the use of the Pearson and Spearman correlation coefficients (Barr et al., 2004a[link]). In Fig. 3.8.1[link](a) r = 0.93 and R = 0.68. The high parametric coefficient arises from the perfect match of the two biggest peaks, but the much lower Spearman coefficient acts as a warning that there are unmatched regions in the two patterns. In Fig. 3.8.1[link](b) the situation is reversed: r = 0.79, whereas R = 0.90, and it can be seen that there is a strong measure of association with the two patterns, although there are some discrepancies in the region 15–35°. In Fig. 3.8.1[link](c) r = 0.66 and R = 0.22; in this case the Spearman test is again warning of missing match regions. Thus, the use of the two coefficients acts as a valuable balance of their respective properties when processing complete patterns. The Spearman coefficient is also robust in the statistical sense and useful in the case of preferred orientation.

[Figure 3.8.1]

Figure 3.8.1 | top | pdf |

The use of the Pearson (r) and Spearman (R) correlation coefficients to quantitatively match powder patterns: (a) r = 0.93, R = 0.68; (b) r = 0.79, R = 0.90; (c) r = 0.66, R = 0.22.

3.8.2.3. Combining the correlation coefficients

| top | pdf |

Correlation coefficients are not additive, so it is invalid to average them directly; they need to be transformed into the Fisher Z value to give[{\rho _{ij}}\, = \,\tanh \left [{\left({{{\tanh }^{ - 1}}{R_{ij}}\, + \,{{\tanh }^{ - 1}}{r_{ij}}} \right)/2} \right]. \eqno(3.8.3)]

3.8.2.4. Full-profile qualitative pattern matching

| top | pdf |

Before performing pattern matching, some data pre-processing may be necessary. In order not to produce artefacts, this should be minimized. Typical pre-processing activities are:

  • (1) The data are normalized such that the maximum peak intensity is 1.0.

  • (2) The patterns need to be interpolated if necessary to have common increments in 2θ. High-order polynomials using Neville's algorithm can be used for this (Press et al., 2007[link]).

  • (3) If backgrounds are large they should be removed. High-throughput data are often very noisy because of low counting times and the sample itself. If this is the case, smoothing of the data can be carried out. The SURE (Stein's Unbiased Risk Estimate) thresholding procedure (Donoho & Johnstone, 1995[link]; Ogden, 1997[link]) employing wavelets is ideal for this task since it does not introduce potentially damaging artefacts, for example ringing around peaks (Barr et al., 2004a[link]; Smrčok et al., 1999[link]).

After pre-processing, which needs to be carried out in an identical way for each sample, the following steps are carried out:

  • (1) The intersecting 2θ range of the two data sets is calculated, and each of the pattern correlation coefficients is calculated using only this region.

  • (2) A minimum intensity is set, below which profile data are set to zero. This reduces the contribution of background noise to the matching process without reducing the discriminating power of the method. We usually set this to 0.1Imax as a default, where Imax is the maximum measured intensity.

  • (3) The Pearson correlation coefficient is calculated.

  • (4) The Spearman R is computed in the same way.

  • (5) An overall ρ value is calculated using (3.8.3)[link].

  • (6) A shift in 2θ values between patterns is often observed, arising from equipment settings and data-collection protocols. Three possible simple corrections are[\Delta \left({2\theta } \right) = {a_0} + {a_1}\cos \theta, \eqno(3.8.4)]which corrects for the zero-point error via the a0 term and, via the a1 cos θ term, for varying sample heights in reflection mode, or[\Delta \left({2\theta } \right) = {a_0} + {a_1}\sin \theta, \eqno(3.8.5)]which corrects for transparency errors, for example, and[\Delta \left({2\theta } \right) = {a_0} + {a_1}\sin 2\theta, \eqno(3.8.6)]which provides transparency coupled with thick specimen error corrections, where a0 and a1 are constants that can be determined by shifting patterns to maximize their overlap as measured by ρ. It is difficult to obtain suitable expressions for the derivatives [\partial {a_0}/\partial {\rho_{ij}}] and [\partial {a_1}/\partial {\rho_{ij}}] for use in the optimization, so we use the downhill simplex method (Nelder & Mead, 1965[link]), which does not require their calculation.

3.8.2.5. Generation of the correlation and distance matrices

| top | pdf |

Using equation (3.8.3)[link], a correlation matrix is generated in which a set of n patterns is matched with every other to give a symmetric (n × n) correlation matrix ρ with unit diagonal. The matrix ρ can be converted to a Euclidean distance matrix, d, of the same dimensions via[{\bf{d}} = 0.5\left({1.0 - \boldrho } \right) \eqno(3.8.7)]or a distance-squared matrix,[{\bf{D}} = 0.25{\left({1 - \boldrho } \right)^2} \eqno(3.8.8)]for each entry i, j in d, [0.0 \le {d_{ij}} \le 1.0]. A correlation coefficient of 1.0 translates to a distance of 0.0, a coefficient of −1.0 to 1.0, and zero to 0.5. There are other methods of generating a distance matrix from ρ (see, for example, Gordon, 1981[link], 1999[link]), but we have found this to be both simple and as effective as any other.

For other purposes a dissimilarity matrix s is also needed, whose elements are defined via[s_{ij} = 1 - d_{ij}/d^{\max } ,\eqno(3.8.9)]where dmax is the maximum distance in matrix d. A dissimilarity matrix, δ, is also generated with elements[\delta _{ij} = d_{ij}/d_{ij}^{\max}. \eqno(3.8.10)]In some cases it can be advantageous to use I1/2 in the distance-matrix generation; this can enhance the sensitivity of the clustering to weak peaks (Butler et al., 2019[link]).

3.8.3. Cluster analysis

| top | pdf |

Cluster analysis uses d (or s, or δ) to partition the patterns into groups based on the similarity of their diffraction profiles. Associated with cluster are a number of important ancillary techniques all of which will be discussed here. A flowchart of these methods is shown in Fig. 3.8.4[link].

3.8.3.1. Dendrograms

| top | pdf |

Using d and s, agglomerative, hierarchical cluster analysis is now carried out, in which the patterns are put into clusters as defined by their distances from each other. [Gordon (1981[link], 1999[link]) and Everitt et al. (2001[link]) provide excellent and detailed introductions to the subject. Note that the two editions of Gordon's monograph are quite distinct and complementary.] The method begins with a situation in which each pattern is considered to be in a separate cluster. It then searches for the two patterns with the shortest distance between then, and joins them into a single cluster. This continues in a stepwise fashion until all the patterns form a single cluster. When two clusters (Ci and Cj) are merged, there is the problem of defining the distance between the newly formed cluster [C_i\cup C_j] and any other cluster Ck. There are a number of different ways of doing this, and each one gives rise to a different clustering of the patterns, although often the difference can be quite small. A general algorithm has been proposed by Lance & Williams (1967[link]), and is summarized in a simplified form by Gordon (1981[link]). The distance from the new cluster formed by merging Ci and Cj to any other cluster Ck is given by[\eqalignno{d({{C_i} \cup {C_j},\,{C_k}}) &= {\alpha _i}d({{C_i},{C_k}}) + {\alpha _j}d({{C_j},{C_k}}) + \beta d({{C_i},{C_j}})&\cr&\quad + \gamma \left| {d({{C_i},{C_k}}) - d({{C_j},{C_k}})} \right|. &(3.8.11)}]There are many possible clustering methods. Table 3.8.1[link] defines six commonly used clustering methods, defined in terms of the parameters α, β and γ. All these methods can be used with powder data; in general, the group-average-link or single-link formalism is the most effective, although differences between the methods are often slight.

Table 3.8.1| top | pdf |
Six commonly used clustering methods

For each method, the coefficients αi, β and γ in equation (3.8.11)[link] are given.

Methodαiβγ
Single link ½ 0 −½
Complete link ½ 0 ½
Average link ni/(ni + nj) 0 0
Weighted-average link ½ 0 0
Centroid ni/(ni + nj) ninj/(ni + nj)2 0
Sum of squares (ni + nk)/(ni + nj + nk) nk/(ni + nj + nk) 0

The results of cluster analysis are usually displayed as a dendrogram, a typical example of which is shown in Fig. 3.8.6[link](a), where a set of 13 powder patterns is analysed using the centroid method. Each pattern begins at the bottom of the plot as a separate cluster, and these amalgamate in stepwise fashion linked by horizontal tie bars. The height of the tie bar represents a similarity measure as measured by the relevant distance. As an indication of the differences that can be expected in the various algorithms used for dendrogram generation, Fig. 3.8.6[link](e) shows the same data analysed using the single-link method: the resulting clustering is slightly different: the similarity measures are larger, and, in consequence, the tie bars are higher on the graph. [For further examples see Barr et al. (2004b[link],c[link]) and Barr, Dong, Gilmore & Faber (2004)[link].]

3.8.3.2. Estimating the number of clusters

| top | pdf |

An estimate of the number of clusters present in the data set is needed. In terms of the dendrogram, this is equivalent to `cutting the dendrogram' i.e. the placement of a horizontal line across it such that all the clusters as defined by tie lines above this line remain independent and unlinked. The estimation of the number of clusters is an unsolved problem in classification methods. It is easy to see why: the problem depends on how similar the patterns need to be in order to be classed as the same, and how much variability is allowed within a cluster. We use two approaches: (a) eigenvalue analysis of matrices ρ and A, and (b) those based on cluster analysis.

Eigenvalue analysis is a well used technique: the eigenvalues of the relevant matrix are sorted in descending order and when a fixed percentage (typically 95%) of the data variability has been accounted for, the number of eigenvalues is selected. This is shown graphically via a scree plot, an example of which is shown in Fig. 3.8.2[link].

[Figure 3.8.2]

Figure 3.8.2 | top | pdf |

Four different methods of estimating the number of clusters present in a set of 23 powder patterns for the drug doxazosin. A total of five polymorphs are present, as well as two mixtures of these polymorphs. (a) A scree plot from the eigenvalue analysis of the correlation matrix; (b) the use of the C test (the coefficients have been multiplied by 100.0), which gives an estimate of five clusters using its local maximum. The γ test estimates that there are seven clusters and the CH test has a local maximum at seven clusters. Numerical details are given in Table 3.8.2[link].

We carry out eigenvalue analysis on the following:

  • (1) Matrix ρ.

  • (2) Matrix A, as described in Section 3.8.3.3[link].

  • (3) A transformed form of ρ in which ρ is standardized to give ρs in which the rows and columns have zero mean and unit variance. The matrix [{\boldrho _s}\boldrho _s^T] is then computed and subjected to eigenanalysis. It tends to give a lower estimate of cluster numbers than (1).

The most detailed study on cluster counting is that of Milligan & Cooper (1985[link]), and is summarized by Gordon (1999[link]). From this we have selected three tests that seem to operate effectively with powder data:

  • (4) The Calinški & Harabasz (1974[link]) (CH) test:[{\rm CH}(c) = {{\left [{{B \mathord{\left/ {\vphantom {B {\left({c - 1} \right)}}} \right. \kern-\nulldelimiterspace} {\left({c - 1} \right)}}} \right]} \mathord{\left/ {\vphantom {{\left [{{B \mathord{\left/ {\vphantom {B {\left({c - 1} \right)}}} \right. \kern-\nulldelimiterspace} {\left({c - 1} \right)}}} \right]} {\left [{{W \mathord{\left/ {\vphantom {W {\left({n - c} \right)}}} \right. \kern-\nulldelimiterspace} {\left({n - c} \right)}}} \right]}}} \right. \kern-\nulldelimiterspace} {\left [{{W \mathord{\left/ {\vphantom {W {\left({n - c} \right)}}} \right. \kern-\nulldelimiterspace} {\left({n - c} \right)}}} \right]}}. \eqno(3.8.12)]A centroid is defined for each cluster. W denotes the total within-cluster sum of squared distances about the cluster centroids, and B is the total between-cluster sum of squared distances. Parameter c is the number of clusters chosen to maximize CH.

  • (5) A variant of Goodman & Kruskal's (1954[link]) γ test, as described by Gordon (1999[link]). The dissimilarity matrix is used. A comparison is made between all the within-cluster dissimilarities and all the between-cluster dissimilarities. Such a comparison is marked as concordant if the within-cluster dissimilarity is less than the between-cluster dissimilarity, and discrepant otherwise. Equalities, which are unusual, are disregarded. If S+ is the number of concordant and S the number of discrepant comparisons, then[\gamma \left(c \right) = {{\left({{S_ + } - {S_ - }} \right)} \mathord{\left/ {\vphantom {{\left({{S_ + } - {S_ - }} \right)} {\left({{S_ + } + {S_ - }} \right)}}} \right. \kern-\nulldelimiterspace} {\left({{S_ + } + {S_ - }} \right)}}. \eqno(3.8.13)]A maximum in γ is sought by an appropriate choice of cluster numbers.

  • (6) The C test (Milligan & Cooper, 1985[link]). This chooses the value of c that minimizes[{\rm C}\left(c \right) = {{\left[{W(c) - {W_{\min }}} \right]} \mathord{\left/ {\vphantom {{\left({W(c) - {W_{\min }}} \right)} {\left({{W_{\max }} - {W_{\min }}} \right)}}} \right. \kern-\nulldelimiterspace} {\left({{W_{\max }} - {W_{\min }}} \right)}}. \eqno(3.8.14)]W(c) is the sum of all the within-cluster dissimilarities. If the partition has a total of r such dissimilarities, then Wmin is the sum of the r smallest dissimilarities and Wmax is the sum of the r largest.

The results of tests (4)–(6) depend on the clustering method being used. To reduce the bias towards a given dendrogram method, these tests are carried out on four different clustering methods: the single-link, the group-average, the sum-of-squares and the complete-link methods. Thus there are 12 semi-independent estimates of the number of clusters from clustering methods, and three from eigenanalysis, making 15 in all.

A composite algorithm is used to combine these estimates. The maximum and minimum values of the number of clusters (cmax and cmin, respectively) given by the eigenanalysis results [(1)–(3) above] define the primary search range; tests (4)–(6) are then used in the range [\min ({c_{\max }} + 3,n) \le c \le \max ({c_{\min }} - 3,0)] to find local maxima or minima as appropriate. The results are averaged, any outliers are removed, and a weighted mean value is taken of the remaining indicators, then this is used as the final estimate of the number of clusters. Confidence levels for c are also defined by the estimates of the maximum and minimum cluster numbers after any outliers have been removed.

A typical set of results for the PXRD data from 23 powder patterns for doxazosin (an anti-hypertension drug) in which five polymorphs are present, as well as two mixtures of polymorphs, is shown in Fig. 3.8.2[link](a) and (b) (see also Table 3.8.2[link]). The scree plot arising from the eigenanalysis of the correlation matrix indicates that 95% of the variability can be accounted for by five components, and this is shown in Fig. 3.8.2[link](a). Eigenvalues from other matrices indicate that four clusters are appropriate. A search for local optima in the CH, γ and C tests is then initiated in the range 2–8 possible clusters. Four different clustering methods are tried, and the results indicate a range of 4–7 clusters. There are no outliers, and the final weighted mean value of 5 is calculated. As Fig. 3.8.2[link](b) shows, the optimum points for the C and γ tests are often quite weakly defined (Barr et al., 2004[link]b).

Table 3.8.2| top | pdf |
Estimate of the number of clusters for the 23 sample data set for doxazosin

There are five polymorphs present, plus two mixtures of these polymorphs. The maximum estimate is 7; the minimum estimate is 4; the combined weighted estimate of the number of clusters is 6, and the median value is 5. The dendrogram cut level is set to give 5 clusters, and the lower and upper confidence limits are 4 and 7, respectively.

MethodNo. of clusters
Principal-component analysis (non-transformed matrix) 5
Principal-component analysis (transformed matrix) 4
Multidimensional metric scaling 4
γ statistic using single linkage 7
CH statistic using single linkage 7
C statistic using single linkage
γ statistic using group averages 7
CH statistic using group averages 5
C statistic using group averages
γ statistic using sum of squares
CH statistic using sum of squares 5
C statistic using sum of squares
γ statistic using complete linkage
CH statistic using complete linkage 5
C statistic using complete linkage

3.8.3.3. Metric multidimensional scaling

| top | pdf |

This is, in its essentials, the particle-in-a-box problem. Each powder pattern is represented as a single sphere, and these spheres are placed in a cubic box of unit dimensions such that the positions of the spheres reproduce as closely as possible the distance matrix, d, generated from correlating the patterns. The spheres have an arbitrary orientation in the box.

To do this, the (n × n) distance matrix d is used in conjunction with metric multidimensional scaling (MMDS) to define a set of p underlying dimensions that yield a Euclidean distance matrix, dcalc, whose elements are equivalent to or closely approximate the elements of d.

The method works as follows (Cox & Cox, 2000[link]; Gower, 1966[link]; Gower & Dijksterhuis, 2004[link]).

The matrix d has zero diagonal elements, and so is not positive semidefinite. A positive definite matrix, A(n × n) can be constructed, however, by computing[{\bf A} = - {1 \over 2}\left [{\bf I}_n - {1 \over n}{\bf i}_n{\bf i}_n^\prime \right]{\bf{D}}\left [{{{\bf{I}}_n} - {1 \over n}{{\bf{i}}_n}{\bf{i}}_n^\prime} \right] ,\eqno(3.8.15)]where In is an (n × n) identity matrix, in is an (n × 1) vector of unities and D is defined in equation (3.8.8)[link]. The matrix [ [{{{\bf{I}}_n} - ({1 /n}){{\bf{i}}_n}{\bf{i}}_n^\prime}]] is called a centring matrix, since A has been derived from D by centring the rows and columns.

The eigenvectors v1, v2,…, vn and the corresponding eigenvalues λ1, λ2,…, λn are then obtained. A total of p eigenvalues of A are positive and the remaining (np) will be zero. For the p non-zero eigenvalues a set of coordinates can be defined via the matrix X(n × p),[{\bf{X}} = {\bf{V}}{\Lambda ^{1/2}}, \eqno(3.8.16)]where Λ is the vector of eigenvalues.

If p = 3, then we are working in three dimensions, and the X(n × 3) matrix can be used to plot each pattern as a single point in a 3D graph. This assumes that the dimensionality of the problem can be reduced in this way while still retaining the essential features of the data. As a check, a distance matrix dcalc can be calculated from X(n × 3) and correlated with the observed matrix d using both the Pearson and Spearman correlation coefficients. In general the MMDS method works well, and correlation coefficients greater than 0.95 are common. For large data sets this can reduce to ∼0.7, which is still sufficiently high to suggest the viability of the procedure. Parallel coordinates based on the MMDS analysis can also be used, and this is discussed in Sections 3.8.4.2.1[link] and 3.8.4.2.2[link].

There are occasions in which the underlying dimensionality of the data is 1 or 2, and in these circumstances the data project onto a plane or a line in an obvious way without any problems.

An example of an MMDS plot is shown in Fig. 3.8.6[link](b), which is linked to the dendrogram in Fig. 3.8.6[link](a).

3.8.3.4. Principal-component analysis

| top | pdf |

It is also possible to carry out principal-component analysis (PCA) on the correlation matrix. The eigenvalues of the correlation matrix can be used to estimate the number of clusters present via a scree plot, as shown in Fig. 3.8.2[link](a), and the eigenvectors can be used to generate a score plot, which is an X(n × 3) matrix and can be used as a visualization tool in exactly the same way as the MMDS method to indicate which patterns belong to which class. Score plots traditionally use two components with the data thus projected on to a plane; we use 3D plots in which three components are represented. In general, we find that the MMDS representation of the data is nearly always superior to the PCA analysis for powder and spectroscopic data.

3.8.3.5. Choice of clustering method

| top | pdf |

It is possible to use the MMDS plot (or, alternatively, PCA score plots) to assist in the choice of clustering method, since the two methods operate semi-independently. The philosophy here is to choose a technique that results in the tightest, most isolated clusters as follows:

  • (1) The MMDS formalism is used to derive a set of three-dimensional coordinates stored in matrix X(n × 3).

  • (2) The number of clusters, c, is estimated as described in Section 3.8.3.2[link].

  • (3) Each of six dendrogram methods (see Table 3.8.1[link]) is employed in turn, stopping when c clusters have been generated. Each entry in X can now be assigned to a cluster.

  • (4) A sphere is drawn around each point in X and the average between-cluster overlap of the spheres is calculated for each of the N clusters C1 to CN. If the total number of overlaps is m, this can be written as[S = \sum\limits_{i = 1}^n \sum\limits_{\scriptstyle j = 1,n \hfill \atop \scriptstyle j \ne i \hfill} ^n \left(\int_V s_{i \in {C_i}}s_{j \in {C_j}}\,{\rm d}s \right)\bigg/m. \eqno(3.8.17)]If the clusters are well defined then S should be a minimum. Conversely, poorly defined clusters will tend to have large values of S. In the algorithm used in PolySNAP (Barr, Dong & Gilmore, 2009[link]) and DIFFRAC.EVA (Bruker, 2018[link]), the sphere size depends on the number of diffraction patterns.

  • (5) The tightness of each cluster is also estimated by computing the mean within-cluster distance. This should also be a minimum for well defined, tight clusters.

  • (6) The mean within-cluster distance from the centroid of the cluster can also be computed, which should also be a minimum.

  • (7) Steps (4)–(6) are repeated using coordinates derived from PCA 3D score plots.

  • (8) Tests (4)–(7) are combined in a weighted, suitably scaled mean to give an overall figure of merit (FOM); the minimum is used to select which dendrogram method to use (Barr et al., 2004[link]b).

3.8.3.6. The most representative sample

| top | pdf |

Similar techniques can be used to identify the most representative sample in a cluster. This is defined as the sample that has the minimum mean distance from every other sample in the clusters, i.e. for cluster J containing m patterns, the most representative sample, i, is defined as that which gives[\min \left({\sum\limits_{\scriptstyle j = 1 \hfill \atop \scriptstyle i,j \in J \hfill} ^m {d({i,j} )/m} } \right). \eqno(3.8.18)]The most representative sample is useful in visualization and can, with care, be used to create a database of known phases (Barr et al., 2004[link]b).

3.8.3.7. Amorphous samples

| top | pdf |

Amorphous samples are an inevitable consequence of high-throughput experiments, and need to be handled correctly if they are not to lead to erroneous indications of clustering. To identify amorphous samples the total background for each pattern is estimated and its intensity integrated; the integrated intensity of the non-background signal is then calculated. If the ratio falls below a preset limit (usually 5%, but this may vary with the type of samples under study) the sample is treated as amorphous. The distance matrix is then modified so that each amorphous sample is given a distance and dissimilarity of 1.0 from every other sample, and a correlation coefficient of zero. This automatically excludes the samples from the clustering until the last amalgamation steps, and also limits their effect on the estimation of the number of clusters (Barr et al., 2004[link]b). Of course, the question of amorphous samples is not a binary (yes/no) one: there are usually varying degrees of amorphous content, which further complicates matters.

3.8.4. Data visualization

| top | pdf |

3.8.4.1. Primary data visualization

| top | pdf |

It is important when dealing with large data sets to have suitable visualization tools. These tools are also a valuable resource for exploring smaller data sets. This methodology provides four primary aids:

  • (1) A pie chart is produced for each sample, corresponding to the sample wells used in the data-collection process, in which each well is given a colour as defined by the dendrogram. If mixtures of known phases are detected, the pie charts give the relative proportions of the pure samples as estimated by quantitative analysis (see Section 3.8.7[link]).

  • (2) The dendrogram gives the clusters, the degree of association within the clusters and the differential between a given cluster and its neighbours. Different colours are used to distinguish each cluster. The cut line is also drawn along with the associated confidence levels. The dendrogram is the primary visualization tool.

  • (3) The MMDS method reproduces the data as a 3D plot in which each point represents a single powder pattern. The colour for each point is taken from the dendrogram. The most representative sample for each cluster is marked with a cross.

  • (4) Similarly, the eigenvalues from principal-component analysis can be used to generate a 3D score plot in which each point also represents a powder pattern. Just as in the MMDS formalism, the colour for each point is taken from the dendrogram, and the most representative sample is marked with a cross.

These aids give graphical views of the data that are semi-independent and thus can be used to check for consistency and discrepancies in the clustering. They are also interactive. No one method is optimal, and a combination of mathematical and visualization techniques is required, techniques that often need tuning for each individual application (Barr, Cunningham et al., 2009[link]; Barr, Dong & Gilmore, 2009[link]).

3.8.4.2. Secondary visualization using parallel coordinates, the grand tour and minimum spanning trees

| top | pdf |

In the MMDS and PCA methods p = 3 [equation (3.8.16)[link]] to work in three dimensions; the X matrix can then be used to plot each pattern as a single point in a 3D graph. However, this has reduced the dimensionality of the data to three, and the question arises as to the validity of this: are three dimensions sufficient? The use of parallel-coordinates plots coupled with the grand tour can assist here as well as giving us an alternative view of the data.

3.8.4.2.1. Parallel-coordinates plots

| top | pdf |

A parallel-coordinates plot is a graphical data-analysis technique for plotting multivariate data. Usually orthogonal axes are used when doing this, but in parallel-coordinates plots orthogonality is abandoned and replaced with a set of N equidistant parallel axes, one for each variable and labelled X1, X2, X3,…, XN (Inselberg, 1985[link], 2009[link]; Wegman, 1990[link]). Each data point is plotted on each axis and the points are joined via a line connecting each data point. The data now become a set of lines. The lines are given the colours of the cluster to which they belong as defined by the current dendrogram. A parallel-coordinates display can be interpreted as a generalization of a two-dimensional scatterplot, and it allows the display of an arbitrary number of dimensions. The method can also be used to validate the clustering itself without using dendrograms. Using this technique it is possible to determine whether the clustering shown by the MMDS (or PCA) plot in three dimensions continues in higher dimensions.

Fig. 3.8.3[link] shows a typical example for a set of 80 organic samples partitioned into four clusters (Barr, Dong & Gilmore, 2009[link]). The plot shows that the clustering looks realistic when viewed in this way and that it is maintained when the data are examined in six dimensions.

[Figure 3.8.3]

Figure 3.8.3 | top | pdf |

Example of a parallel-coordinates plot in six dimensions, with axes labeled X1, X2, …, X6, for a set of 80 organic PXRD samples partitioned into four clusters. The plot shows that the clustering looks realistic and that it is maintained when the data are examined in six dimensions.

3.8.4.2.2. The grand tour

| top | pdf |

The grand tour is a method of animating the parallel-coordinates plot to examine it from all possible viewpoints. Consider a 3D data plot using orthogonal axes: a grand tour takes 2D sections through these data and displays them in parallel-coordinates plots in a way that explores the entire space in a continuous way. The former is important, because the data can be seen from all points of view, and the latter allows the user the follow the data without abrupt discontinuities. This concept was devised by Asimov (1985[link]) and further developed by Wegman (1990[link]). In more than three dimensions it becomes a generalized rotation of all the coordinate axes. A d-dimensional tour is a continuous geometric transformation of a d-dimensional coordinate system such that all possible orientations of the coordinate axes are eventually achieved. The algorithm for generating a smooth and complete view of the data is described by Asimov (1985[link]).

To do this, the restriction of p = 3 in the MMDS calculation is relaxed to 6, so that there is now a 6D data set with six orthogonal axes. The choice of six is somewhat arbitrary – more can be used, but six is sufficient to see whether the clustering is maintained without generating unduly complex plots and requiring extensive computing resources. The data are plotted as a parallel-coordinates plot. The grand-tour method is then applied by a continuous geometric transformation of the 6D coordinate system such that all possible orientations of the axes are achieved. Each orientation is reproduced as a parallel-coordinates plot using six axes.

Figs. 3.8.9[link](j) and (k) show an example from the clustering of the 13 aspirin samples using PXRD data. Fig. 3.8.9[link](j) shows the default parallel-coordinates plot. Fig. 3.8.9[link](k) shows alternative views of the data taken from the grand tour. In Fig. 3.8.9[link](j) there appears to be considerable overlap between clusters in the 4th, 5th and 6th dimensions (X4, X5 and X6), but the alternative view given in Fig. 3.8.9[link](k) show that the clustering is actually well defined in all six dimensions (Barr, Dong & Gilmore, 2009[link]).

3.8.4.2.3. Powder data as a tree: the minimum spanning trees

| top | pdf |

The minimum spanning tree (MST) displays the MMDS plot as a tree whose points are the data from the MMDS calculation (in three dimensions) and whose weights are the distances between these points. The minimum-spanning-tree problem is that of joining the points with a minimum total edge weight. (As an example, airlines use minimum spanning trees to work out their basic route systems: the best set of routes taking into account airport hubs, passenger numbers, fuel costs etc. is the minimum spanning tree.) Because a tree is used, each point is only allowed a maximum of three connections to other points.

To do this Kruskal's (1956[link]) algorithm can be used, in which the lowest weight edge is always added to see if it builds a spanning tree; if so, it is added or otherwise discarded. This process continues until the tree is constructed. An example is shown in Figs. 3.8.7[link] for the 13-sample aspirin data. A complete tree for this data set using three dimensions and the MMDS-derived coordinates is shown in Fig. 3.8.7[link](a). This has 12 links between the 13 data points. Reducing the number of links to 10 gives Fig. 3.8.7[link](b).

3.8.5. Further validating and visualizing clusters: silhouettes and fuzzy clustering

| top | pdf |

Other techniques exist to validate the clusters, and these are discussed here.

3.8.5.1. Silhouettes

| top | pdf |

Silhouettes (Rousseeuw, 1987[link]; Kaufman & Rousseeuw, 1990[link]) are a property of every member of a cluster and define a coefficient of cluster membership. To compute them, the dissimilarity matrix, δ, is used. If the pattern i belongs to cluster Cr which contains nr patterns, we define[{a_i} = \sum\limits_{\scriptstyle j \in {C_r} \hfill \atop \scriptstyle j \ne i \hfill} \delta _{ij} / {\left({{n_r} - 1} \right)}. \eqno(3.8.19)]This defines the average dissimilarity of pattern i to all the other patterns in cluster Cr. Further define[{b_i} = {{\rm min}_{s \ne r}}\left\{ {\sum\limits_{j \in {C_s}} {{\delta _{ij}}/{n_s}} } \right\} .\eqno(3.8.20)]The silhouette for pattern i is then[{h_i}\, = \,{{{b_i} - {a_i}} \over {\max \left({{a_i},\,{b_i}} \right)}}. \eqno(3.8.21)]Clearly [-1\leq h_i \leq 1.0]. It is not possible to define silhouettes for clusters with only one member (singleton clusters). Silhouettes are displayed such that each cluster is represented as a histogram of frequency plotted against silhouette values so that one can look for outliers or poorly connected plots.

From our experience with powder data collected in reflection mode on both organic and inorganic samples (Barr et al., 2004b[link]), we conclude that for any given pattern

  • (1) [h_i\,\gt\, 0.5] implies that pattern i is probably correctly classified;

  • (2) [0.2 \,\lt\, h_i \,\lt\, 0.5] implies that pattern i should be inspected, since it may belong to a different or new cluster;

  • (3) [h_i \,\lt\, 0.2] implies that pattern i belongs to a different or new cluster.

The use of silhouettes in defining the details of the clustering is shown for the aspirin data in Fig. 3.8.8[link]. The silhouettes for the red cluster corresponding to the dendrogram in Fig. 3.8.6[link](a) are shown in Fig. 3.8.8[link](a) and those for the corresponding orange cluster are shown in Fig. 3.8.8[link](b). Both sets of silhouettes have values < 0.5, which indicates that the clustering is not optimally defined. When the cut line is moved to give the dendrogram in Fig. 3.8.6[link](c), the silhouettes for the red cluster are shown in Fig. 3.8.8[link](c). The entry centred on a silhouette value of 0.15 is pattern 3. This implies that pattern 3 is only loosely connected to the cluster and this is demonstrated in Fig. 3.8.8[link](d) where pattern 3 and the most representative pattern for the cluster (No. 9) are superimposed. Although there is a general sense of similarity there are significant differences and the combined correlation coefficient is only 0.62. In Fig. 3.8.8[link](e), the silhouettes for the orange cluster are shown. They imply that this is a single cluster without outliers. The silhouettes for the green cluster corresponding to the dendrogram in Fig. 3.8.6[link](c) are shown in Fig. 3.8.8[link](f). The clustering is poorly defined here.

3.8.5.2. Fuzzy clustering

| top | pdf |

In standard clustering methods a set of n diffraction patterns are partitioned into c disjoint clusters. Cluster membership is defined via a membership matrix U(n × c), where individual coefficients, uik, represent the membership of pattern i of cluster k. The coefficients are equal to unity if i belongs to c and zero otherwise, i.e.[{u_{ik}} \in \,\left [{0,1} \right]\quad(i = 1,\ldots,n\semi k = 1,\ldots,c). \eqno(3.8.22)]If these constraints are relaxed, such that[0 \le {u_{ik}} \le 1\quad\left({i = 1,\ldots,n\semi k = 1,\ldots,c} \right), \eqno(3.8.23)][0\, \lt\, \textstyle\sum\limits_{i = 1}^n {{u_{ik}} \,\lt\, n\quad \left({k = 1,\ldots,c} \right)}\eqno(3.8.24)]and[\textstyle\sum\limits_{k = 1}^c {u_{ik}^{}} = 1, \eqno(3.8.25)]then fuzzy clusters are generated, in which there is the possibility that a pattern can belong to more than one cluster (see, for example, Everitt et al., 2001[link]; Sato et al., 1966[link]). Such a situation is quite feasible in the case of powder diffraction, for example, when mixtures can be involved. It is described in detail by Barr et al. (2004b[link]).

3.8.5.3. The PolySNAP program and DIFFRAC.EVA

| top | pdf |

All these techniques have been incorporated into the PolySNAP computer program (Barr et al., 2004a[link],b[link],c[link]; Barr, Dong, Gilmore & Faber, 2004[link]; Barr, Dong & Gilmore, 2009[link]), which was developed from the SNAP-D software (Barr, Gilmore & Paisley, 2004[link]). PolySNAP has subsequently been incorporated into the Bruker DIFFRAC.EVA program (Bruker, 2018[link]), and the following sections are based on its use.

3.8.6. Examples

| top | pdf |

All the elements for clustering and visualization are now in place. Fig. 3.8.4[link] shows this as a flowchart. Hitherto we have looked at elements of the aspirin data to demonstrate how methods work; we now examine the aspirin data in detail as a single analysis.

[Figure 3.8.4]

Figure 3.8.4 | top | pdf |

Flowchart for the cluster-analysis and data-visualization procedure described in this chapter. The light grey boxes denote data-visualization elements and the dark grey objects are optional data pre-processing operations.

3.8.6.1. Aspirin data

| top | pdf |

In this example we use 13 powder patterns from commercial aspirin samples collected in reflection mode on a Bruker D8 diffractometer. Since these samples include fillers, the active pharmaceutical ingredient (API) and other formulations, it is not surprising that peak widths are high: ∼0.5° full width at half maximum (FWHM). The data-collection range was 10–43° in 2θ using Cu Kα radiation. The 13 powder data sets are shown in Fig. 3.8.5[link] arranged into groups based on similarity. We have already described the methods of analysis and have shown typical results in Figs. 3.8.6 to 3.8.8[link][link][link], and now present detailed examples. The correlation matrix derived from equation (3.8.3)[link] is shown in Fig. 3.8.9[link](a), colour coded to reflect the values of the coefficients; the darker the shade, the higher the correlation. The resulting dendrogram and MMDS plot are shown in Figs. 3.8.9[link](b) and (c), respectively. Four clusters are identified in the dendrogram and these have been appropriately coloured. Other visualization tools are now shown. In Fig. 3.8.9[link](d) the pie chart is displayed; the number of rows can be adjusted to reflect the arrangement of the samples in a multiple sample holder. Fig. 3.8.9[link](e) shows the default minimum spanning tree with 12 links. In Fig. 3.8.9[link](f) the scree plot indicates that three clusters will account for more than 95% of the data variability. The steep initial slope is a clear indication of good cluster estimation. The silhouettes are shown in Fig. 3.8.9[link](gi). These were discussed in Section 3.8.5.1[link]. In Fig. 3.8.9[link](j) the default parallel-coordinates plot for the same data is shown, and in Fig. 3.8.9[link](k) there is another view taken from the grand tour. These two plots validate the clustering and also indicate that there is no significant error introduced into the MMDS plot by truncating it into three dimensions.

[Figure 3.8.5]

Figure 3.8.5 | top | pdf |

Powder patterns for 13 commercial aspirin samples partitioned into five sets. The patterns are in highly correlated sets: (a) comprises patterns 1, 3, 5, 6, 9 and 12; (b) comprises patterns 10, 11 and 13; (c) contains patterns 2 and 4; (d) contains pattern 7 and (e) contains pattern 8.

[Figure 3.8.6]

Figure 3.8.6 | top | pdf |

(a) The initial default dendrogram using the centroid clustering method on 13 PXRD patterns from 13 commercial aspirin samples. (b) The corresponding MMDS plot. It can be seen that both clusters have a natural break in them and should be partitioned into two clusters. (c) The dendrogram cut line is reduced. (d) The corresponding MMDS plot. The red cluster is now partitioned into two; the remaining patterns are a light-blue singleton and a green triplet cluster. (e) The default dendrogram using the single-link method.

[Figure 3.8.7]

Figure 3.8.7 | top | pdf |

The use of minimum spanning trees (MSTs). (a) The MST with 12 links. (b) The MST with 10 links; three clusters are now present.

[Figure 3.8.8]

Figure 3.8.8 | top | pdf |

The use of silhouettes in defining the details of the clustering. (a) The silhouettes for the red cluster in the dendrogram from Fig. 3.8.6(a)[link]. (b) The corresponding orange cluster. Both sets of silhouettes have values that are less than 0.5, which indicates that the clustering is not well defined. (c) The silhouettes for the red cluster corresponding to the dendrogram in Fig. 3.8.6(c)[link]. The entry centred on a silhouette value of 0.15 is pattern 3. This implies that pattern 3 is only loosely connected to the cluster and this is demonstrated in part (d), where pattern 3 and the most representative pattern for the cluster (No. 9) are superimposed. Although there is a general sense of similarity there are significant differences and the combined correlation coefficient is only 0.62. (e) The silhouettes for the orange cluster corresponding to the dendrogram in Fig. 3.8.6(c)[link]. The silhouettes imply that this is a single cluster without outliers. (f) The silhouettes for the green cluster corresponding to the dendrogram in Fig. 3.8.6[link](c). The clustering is poorly defined here.

[Figure 3.8.9]
[Figure 3.8.9]

Figure 3.8.9 | top | pdf |

The complete cluster analysis for the aspirin samples. (a) The correlation matrix, which is the source of all the clustering results. The entries are colour coded: the darker the shade, the higher the correlation. (b) The dendrogram. The colours assigned to the samples are used in all the visualization tools. (c) The corresponding MMDS plot. The clustering defined by the dendrogram is well defined. (d) The pie-chart view. (e) The minimum spanning tree. (f) The scree plot. It indicates that three clusters explain 95% of the variance of the distance matrix derived from (a). (gi) The silhouettes for the red, the orange and the green clusters, respectively. These are discussed in detail in the caption to Fig. 3.8.8[link]. (j) The default parallel-coordinates plot. The clusters are well maintained into the 4th, 5th and 6th dimensions. (k) Another view of the parallel coordinates using the grand tour. The clustering remains well maintained in higher dimensions.

3.8.6.1.1. Aspirin data with amorphous samples included

| top | pdf |

As a demonstration of the handling of data from amorphous samples, five patterns for amorphous samples were included in the aspirin data and the clustering calculation was repeated. The results are shown in Fig. 3.8.10[link]. Fig. 3.8.10[link](a) shows the dendrogram. It can be seen that the amorphous samples are positioned as isolated clusters on the right-hand end. They also appear as an isolated cluster in the MMDS plot and the parallel-coordinates plots, as shown in Figs. 3.8.10[link](b) and (c). It could be argued that these samples should be treated as a single, five-membered cluster rather than five individuals, but we have found that this confuses the clustering algorithms, and it is clearer to the user if the data from amorphous samples are presented as separate classes.

[Figure 3.8.10]

Figure 3.8.10 | top | pdf |

The aspirin data including data from five amorphous samples. (a) The resulting dendrogram and (b) the corresponding MMDS plot. (c) The parallel-coordinates plot.

3.8.6.2. Phase transitions in ammonium nitrate

| top | pdf |

Ammonium nitrate exhibits temperature-induced phase transformations. Between 256 and 305 K it crystallizes in the orthorhombic space group Pmmm with a = 5.745, b = 5.438, c = 4.942 Å and Z = 2; from 305 to 357 K it crystallizes in Pbnm with a = 7.14, b = 7.65, c = 5.83 Å with Z = 4; between 357 and 398 K it crystallizes in the tetragonal space group [P\overline 4 {2_1}m] with a = 5.719, c = 4.932 Å, Z = 2, and above 398 K it transforms to the cubic space group [Pm\overline 3 m] with a = 4.40 Å and Z = 1. PXRD data containing 75 powder patterns taken at intervals of 3 K starting at 203 K using a D5000 Siemens diffractometer and Cu Kα radiation with a 2θ range of 10–100° were used (Herrmann & Engel, 1997[link]). Fig. 3.8.11[link](a) shows the data in the 2θ range 17–45°.

[Figure 3.8.11]

Figure 3.8.11 | top | pdf |

Ammonium nitrate phase transitions. (a) The raw powder data measured between 203 and 425 K. Reproduced with permission from Herrmann & Engel (1997[link]). Copyright (1997) John Wiley and Sons. (b) The MMDS plot. The purple line follows the temperature change from 203 to 425 K.

The visualization of these data following cluster analysis is shown in Fig. 3.8.11[link](b) using an MMDS plot on which has been superimposed a line showing the route followed by the temperature increments. The purple line follows the transition from a mixture of forms IV and V at low temperature (red) through form IV (yellow), form II (blue) and finally form I at high temperature (green). This is an elegant and concise representation of the data in a single diagram.

3.8.7. Quantitative analysis with high-throughput PXRD data without Rietveld refinement

| top | pdf |

Since mixtures are so common in high-throughput experiments, and indeed in many situations with multiple data sets, it is useful to have a method of automatic quantitative analysis. The quality of data that results from high-throughput crystallography makes it unlikely that an accuracy better than 5–10% can be achieved but, nonetheless, the identification of mixtures can be carried out by whole-profile matching. First a database of N pure phases is created, or, if that is not possible, then the most representative patterns with appropriate safeguards can be used. Assume that there is a sample pattern, S, which is considered to be a mixture of up to N components. S comprises m data points, S1, S2, …, Sm. The N patterns can be considered to make up fractions p1, p2, p3, …, pN of the sample pattern. The best possible combination of the database patterns to fit the sample pattern is required. A system of linear equations can be constructed in which x11 is measurement point 1 of pattern 1 etc.:[\eqalignno{x_{11}p_1 + x_{12}p_2 + x_{13}p_3 + &\ldots + x_{1N}p_N = S_1,&\cr x_{21}p_1 + x_{22}p_2 + x_{23}p_3 + &\ldots + x_{2N}p_N = S_2, \cr &\vdots&\cr x_{m1}p_1 + x_{m2}p_2 + x_{m3}p_3 +&\ldots + x_{mN}p_N = S_m.&(3.8.26)}]

Writing these in matrix form, we get[\left [{\matrix{ {{x_{11}}} & {{x_{12}}} & {{x_{13}}} & \ldots & {{x_{1N}}} \cr {{x_{21}}} & {{x_{22}}} & {{x_{23}}} & \ldots & {{x_{2N}}} \cr \vdots & \vdots & \vdots & \ddots& \vdots \cr {{x_{m1}}}& {{x_{m2}}}& {{x_{m3}}} & \cdots & {{x_{mN}}} \cr } } \right]\left [{\matrix{ {{p_1}} \cr {{p_2}} \cr \vdots \cr {{p_N}} \cr } } \right] = \left [{\matrix{ {{S_1}} \cr {{S_2}} \cr \vdots \cr {{S_N}} \cr } } \right] \eqno(3.8.27)]or[{\bf{x}}{\bf{p}}={\bf{S}}. \eqno(3.8.28)]A solution for S that minimizes[{\chi ^2} = {\left| {{\bf{x}} {\bf{p}}-{\bf{S}}} \right|^2}. \eqno(3.8.29)]is required. Since [N\ll m], the system is heavily overdetermined, and least-squares or singular value decomposition can be used to solve (3.8.29)[link] for the fractional percentages arising from the scattering power of the component mixtures, [s_1,s_2,\ldots, s_N]. The values of s can be used to calculate a weight fraction for that particular phase provided that the atomic absorption coefficients are known, and this in turn requires the unit-cell dimensions and cell contents, but not the atomic coordinates (Smith et al., 1988[link]; Cressey & Schofield, 1996[link]). The general formula for the weight fraction of component n in a mixture comprising N components is (Leroux et al., 1953[link])[{c_n} = {p_n}{{{\mu ^*}} \over {\mu _n^*}}, \eqno(3.8.30)]where[{\mu ^*} = \textstyle\sum\limits_{j = 1}^N {{c_j}\mu _j^*} \eqno(3.8.31)]and[\mu _j^* = \mu _j/ \rho_j, \eqno(3.8.32)]where μj is the atomic X-ray absorption coefficient and ρj is the density of component j. For polymorphs, the absorption coefficients are sufficiently close and the method sufficiently approximate that the effects of absorption can be ignored.

3.8.7.1. Example: inorganic mixtures

| top | pdf |

As an example, a set of 19 patterns from set 78 of the ICDD database for inorganic compounds (ICDD, 2018[link]) was imported into DIFFRAC.EVA. To this was added some simulated mixture data generated by adding the patterns for lanthanum strontium copper oxide and caesium thiocyanate in the proportions 80/20, 60/40, 50/50, 40/60 and 20/80. Two calculations were performed: an analysis without the pure-phase database and a second where the pure phases of lanthanum strontium copper oxide and caesium thiocyanate were present.

The results are shown in Fig. 3.8.12[link]. In the MMDS plot the green spheres represent pure lanthanum strontium copper oxide while the yellow are pure caesium thiocyanate. The red spheres represent mixtures of the two. The latter form an arc between the green and yellow clusters. The distance of the spheres representing mixtures from the lanthanum strontium copper oxide and caesium thiocyanate spheres gives a semi-quantitative representation of the mixture contents. Running the analysis in quantitative mode gives the pie charts also shown in Fig. 3.8.12[link]; they reproduce exactly the relative proportions of the three components.

[Figure 3.8.12]

Figure 3.8.12 | top | pdf |

Identifying mixtures using lanthanum strontium copper oxide and caesium thiocyanate diffraction data taken from the ICDD Clay Minerals database. The green spheres represent pure phases of lanthanum strontium copper oxide and the yellow pure caesium thiocyanate. The red spheres represent mixtures of the two in the relative proportions of lanthanum strontium copper oxide/caesium thiocyanate 80/20, 60/40, 50/50, 40/60 and 20/80 in an arc commencing on the left-hand side of the diagram. The pie charts give the results of an independent quantitative calculation in which lanthanum strontium copper oxide and caesium thiocyanate have been included as pure phases in a reference database.

For further details of this method with organic samples, see Dong et al. (2008[link]).

3.8.8. Using spectroscopic data

| top | pdf |

There is no reason why the methodology described in this chapter cannot be used for other 1D data sets, e.g. Raman, IR, NMR and near-IR spectroscopies, although different data pre-processing is usually required. Raman spectroscopy is well suited to high-throughput screening: good-quality spectra can be collected in a few minutes, and sample preparation is straightforward and flexible, although the resulting spectra are not always as distinct as the PXRD equivalents (Mehrens et al., 2005[link]; Boccaleri et al., 2007[link]).

As an example we show the results of cluster analysis carried out on samples of carbamazepine, cimetidene, furosemide, mefenamic acid, phenilbutazone and sulfamerazine using Raman spectroscopy. A total of 74 samples were measured on a LabRam HR-800/HTS-Multiwell spectrometer at room temperature, equipped with a backscattering light path system of a light-emitting diode laser (785 nm, 300 mW) as an excitation source and an air-cooled charge-coupled device detector. A 20-fold superlong working distance objective lens was used to collect the backscattered light. The spectra were acquired with 5.84 cm−1 spectral width and at least 30 s exposure (Kojima et al., 2006[link]). The spectra had backgrounds subtracted but no other corrections were carried out.

The initial clustering is shown in Fig. 3.8.13[link](a) with the default cut level in the dendrogram. There are six clusters: labelling from the left-hand side, the red are three polymorphs of carbamazepine; the orange are cimetidene; the green cluster contains three polymorphs of furosemide; the light blue contains three polymorphs of mefenamic acid; the dark blue contains phenilbutazone; and finally the purple cluster contains sulfamerazine. The MMDS plot gives a complementary visualization of the data that supports the clustering.

[Figure 3.8.13]

Figure 3.8.13 | top | pdf |

(a) The dendrogram generated from 74 Raman spectra without background corrections applied. Labelling from the left-hand side, the red samples are carbamazepine, the orange are cimetidene, the green are two forms of furosemide, the light blue is mefenamic acid, the dark blue is phenilbutazone and the purple at the right-hand side is sulfamerazine. (b) The MMDS plot. The sphere colours are taken from the dendrogram. This representation shows clearly discrete clusters in correspondence with those generated by the dendrogram.

It is also possible to use derivative data in place of the original spectra for clustering. The results of this for the 74 Raman spectra without initial background subtraction followed by the generation of first-derivative data are shown in Fig. 3.8.14[link]. The clusters are well defined but now the carbamazepine data have split into two clusters. These correspond to forms I and III of carbamazepine, although the differences in the Raman spectra for these three species are small (O'Brien et al., 2004[link]). At the same time, both furosemide and mefenamic acid are each split into two groups. This is probably the best description of the data in terms of clustering and cluster membership corresponding to the chemical differences in the samples. The dendrogram also has the feature that the tie bars between samples are higher, i.e. the similarities are lower, reflecting the fact that the use of first derivatives accentuates small differences in the data.

[Figure 3.8.14]

Figure 3.8.14 | top | pdf |

Clustering the 74 Raman spectra without background corrections applied using first-derivative data. (a) The dendrogram. Labelling from the left-hand side, the red and orange entries are carbamazepine; the green are cimetidene; the light blue and dark blue are two forms of furosemide; the purple are sulfamerazine; the brown are phenilbutazone and the right-hand light and dark green are two forms of mefenamic acid. (b) The MMDS plot. The clusters are well defined but the orange and red (both carbamazepine) are very close to each other.

It is interesting to note that, in general, PXRD works less well with derivative data. The reason for this is not clear, but possibly the presence of partial overlapping peaks and the associated issues of peak shape are partly responsible.

3.8.9. Combining data types: the INDSCAL method

| top | pdf |

It is now common to collect more than one data type, and some instruments now exist for collecting spectroscopic and PXRD data on the same samples, for example the Bruker D8 Screenlab, which combines PXRD and Raman measurement for high-throughput screening (Boccaleri et al., 2007[link]).

A technique for combining the results of more than one data type is needed. One method would be to take individual distance matrices from each data type and generate an average distance matrix using equation (3.8.3)[link], but this leaves open the question of how best to define the associated weights in an optimal, objective way. Should, for example, PXRD be given a higher weight than Raman data? The individual differences scaling method (INDSCAL) of Carroll & Chang (1970[link]) provides an unbiased solution to this problem by, as the name suggests, scaling the differences between individual distance matrices.

In this method, let Dk be the squared distance matrix of dimension (n × n) for data type k with a total of K data types. For example, if we have PXRD, Raman and differential scanning calorimetry (DSC) data for each of n samples, then K = 3. A group-average matrix G (which we will specify in two dimensions) is required that best represents the combination of the K data types. To do this, the D matrices are first put into inner-product form by the double-centring operation to give[{\bf B}_k = - \textstyle{1\over 2}\left({\bf I}-{\bf N} \right){\bf D}_k\left({\bf I} - {\bf N} \right), \eqno(3.8.33)]where I is the identity matrix and N is the centring matrix I11′/N; 1 is a column vector of ones. The inner-product matrices thus generated are matched to the weighted form of the group average, G, which is unknown. To do this the function[S = \textstyle\sum\limits_1^K {\left\| {{{\bf{B}}_k} - {\bf{GW}}_k^2{{\bf{G}}^\prime}} \right\|} \eqno(3.8.34)]is minimized. The weight matrices, Wk, are scaled such that[\textstyle\sum\limits_{k = 1}^K {{\bf{W}}_k^2 = K{\bf{I}}}. \eqno(3.8.35)]The INDSCAL method employs an iterative technique to solve equation (3.8.7)[link] in which one parameter is kept fixed whilst the other is determined by least-squares refinement. An initial estimate for G is taken either from the average of the D matrices for each sample or as a random matrix. This is then used to estimate the weight matrices, and the whole process repeated until a minimum value of S is obtained. The algorithm derived by Carroll and Chang was used in the example below. When random matrices are used to generate the initial G matrix, the INDSCAL procedure is repeated 100 times and the solution with the minimum value of S is kept. In practice, there is very little difference in the results of these two procedures. The resulting G matrix is used as a standard-distance matrix, and used in the standard way to generate dendrograms, MMDS plots etc. The method has the property that where data types show samples to be very similar this is reinforced, whereas where there are considerable variations the differences are accentuated in the final G matrix. For a fuller description of the INDSCAL method with examples see Gower & Dijksterhuis (2004[link]), Section 13.2, and for a useful geometric interpretation see Husson & Pagès (2006[link]).

3.8.9.1. An example combining PXRD and Raman data

| top | pdf |

We now present an example of the INDSCAL method applied to data collected on sulfathiazole using PXRD and Raman spectroscopy (Barr, Cunningham et al., 2009[link]). A flowchart is shown in Fig. 3.8.15[link]. Three polymorphs of sulfathiazole were prepared and PXRD data were collected on a Bruker C2 GADDS system. Each sample was run for 2 min over a 3–30° range in 2θ using Cu Kα radiation. Raman data were collected on a Bruker SENTINEL. The Raman probe was integrated into the PXRD instrument.

[Figure 3.8.15]

Figure 3.8.15 | top | pdf |

A flowchart for the INDSCAL method using Raman and PXRD data. Note that any combination of any 1D data can be used here.

The only data pre-processing performed was background removal. Fig. 3.8.16[link](a) shows the resulting dendrogram (with the default cut level) and Fig. 3.8.16[link](b) shows the corresponding MMDS plot. To identify each sample they are numbered via a four-digit code: the first two digits are the well number, and the last digit defines whether the sample is form 2, 3 or 4 of sulfathiazole. It can be seen that the clustering is only partly successful: form 4 (red) is correctly clustered; form 3 (orange) gives five clusters and form 2 gives three clusters.

[Figure 3.8.16]

Figure 3.8.16 | top | pdf |

Clustering 48 PXRD spectra with background corrections applied for three polymorphs of sulfathiazole. (a) The dendrogram. Each sample is identified by a four-digit code. The first two digits are the well number, and the last digit defines whether the sample is form 2, 3 or 4 of sulfathiazole. (b) The MMDS plot: the red cluster is well defined but the rest of the spheres are diffuse and intermingled. (c) The dendrogram derived from clustering 48 Raman spectra of sulfathiazole with background corrections applied. (d) The corresponding MMDS plot. The clusters are poorly defined. (e) The results of the INDSCAL method. The dendrogram is shown with the default cut level. The clustering is correct; all the samples are placed in the correct group except for patterns 35-2 and 45-2. (f) The MMDS plot validates the dendrogram. (g) The Raman patterns for 35-2 and 45-2 superimposed. They are primarily background noise.

Fig. 3.8.16[link](c) shows the clustering from the Raman spectra. The results are poor: most of form 2 is correctly clustered, but forms 4 and 3 are intermixed, and the MMDS plot in Fig. 3.8.16[link](d) is diffuse with little structure.

The INDSCAL method is now applied starting from random G matrices and the results are shown in Fig. 3.8.16[link](e) and (f) with the dendrogram cut level at its default value. The clustering is almost correct; all the samples are placed in the correct groups except that there are two outliers coloured in blue. Fig. 3.8.16[link](g) shows the Raman patterns for these samples: they are primarily background with very little usable signal.

3.8.10. Quality control

| top | pdf |

Quality control (Gilmore, Barr & Paisley, 2009[link]) is designed for situations where the stability of a material is being monitored over time, for example as part of a production-line system, or for periodic equipment alignment. A set of reference patterns is collected that represents acceptable measurements – any measurement sufficiently close to these references represents a good measurement. Various sample patterns are then imported and compared with those reference patterns, and any that vary significantly from the ideal are noted and highlighted.

The results are best displayed graphically using a variant of the MMDS method, of which an example is shown in Fig. 3.8.17[link]. The reference patterns define a green shaded surface with acceptable sample patterns, coloured red, shown within it, and potentially problematic sample patterns appearing outside it. The volume of the green shape is defined by intersecting spheres around each reference sample and these can be altered to allow more- or less-stringent quality control.

[Figure 3.8.17]

Figure 3.8.17 | top | pdf |

Visualization tools for quality-control procedures using a modified MMDS plot. The red outlier is a sample unacceptably far from the cluster of reference measurements.

3.8.11. Computer software

| top | pdf |

These calculations can be carried out using MATLAB (http://www.mathworks.co.uk/products/matlab/ ) or the open-source R software (http://www.r-project.org/; Crawley, 2007[link]) with graphics using the GGobi software (Cook & Swayne, 2007[link]). There are four commercial packages for handling powder data: DIFFRAC.EVA and PolySNAP 3, both from Bruker (http://www.bruker.com/; Gilmore, Barr & Paisley, 2004[link]; Barr, Dong & Gilmore, 2009[link]), Jade from Materials Data Inc. (http://www.materialsdata.com) and HighScore from Malvern PANalytical (http://www.panalytical.com/).

Acknowledgements

Data were kindly provided by Gordon Cunningham at Glasgow University, Arnt Kern of Bruker AXS in Karlsruhe and Michael Herrman at the Fraunhofer Institute, Germany.

References

Alvarez, A. J., Singh, A. & Myerson, A. S. (2009). Polymorph screening: comparing a semi-automated approach with a high throughput method. Cryst. Growth Des. 9, 4181–4188.Google Scholar
Asimov, D. (1985). The grand tour: a tool for viewing multidimensional data. SIAM J. Sci. Stat. Comput. 6, 128–143.Google Scholar
Barr, G., Cunningham, G., Dong, W., Gilmore, C. J. & Kojima, T. (2009). High-throughput powder diffraction V: the use of Raman spectroscopy with and without X-ray powder diffraction data. J. Appl. Cryst. 42, 706–714.Google Scholar
Barr, G., Dong, W. & Gilmore, C. J. (2004a). High-throughput powder diffraction. II. Applications of clustering methods and multivariate data analysis. J. Appl. Cryst. 37, 243–252.Google Scholar
Barr, G., Dong, W. & Gilmore, C. J. (2004b). High-throughput powder diffraction. IV. Cluster validation using silhouettes and fuzzy clustering. J. Appl. Cryst. 37, 874–882.Google Scholar
Barr, G., Dong, W. & Gilmore, C. J. (2004c). PolySNAP: a computer program for analysing high-throughput powder diffraction data. J. Appl. Cryst. 37, 658–664.Google Scholar
Barr, G., Dong, W. & Gilmore, C. J. (2009). PolySNAP3: a computer program for analysing and visualizing high-throughput data from diffraction and spectroscopic sources. J. Appl. Cryst. 42, 965–974.Google Scholar
Barr, G., Dong, W., Gilmore, C. & Faber, J. (2004). High-throughput powder diffraction. III. The application of full-profile pattern matching and multivariate statistical analysis to round-robin-type data sets. J. Appl. Cryst. 37, 635–642.Google Scholar
Barr, G., Gilmore, C. J. & Paisley, J. (2004). SNAP-1D: a computer program for qualitative and quantitative powder diffraction pattern analysis using the full pattern profile. J. Appl. Cryst. 37, 665–668.Google Scholar
Boccaleri, E., Carniato, F., Croce, G., Viterbo, D., van Beek, W., Emerich, H. & Milanesio, M. (2007). In situ simultaneous Raman/high-resolution X-ray powder diffraction study of transformations occurring in materials at non-ambient conditions. J. Appl. Cryst. 40, 684–693.Google Scholar
Bruker (2018). DIFFRAC.EVA: software to evaluate X-ray diffraction data. Version 4.3. https://www.bruker.com/eva .Google Scholar
Butler, B. M., Sila, A., Nyambura, K. D., Gilmore, C. J., Kourkoumelis, N. & Hillier, S. (2019). Pre-treatment of soil X-ray powder diffraction data for cluster analysis. Geoderma, 337, 413–424.Google Scholar
Calinški, T. & Harabasz, J. (1974). A dendritic method for cluster analysis. Commun. Stat. 3, 1–27.Google Scholar
Carroll, J. D. & Chang, J. J. (1970). Analysis of individual differences in multidimensional scaling via n-way generalization of `Eckhart–Young' decomposition. Psychometria, 35, 283–319.Google Scholar
Cook, D. & Swayne, D. F. (2007). Interactive and Dynamic Graphics for Data Analysis with R and GGobi. New York: Springer.Google Scholar
Cox, T. F. & Cox, M. A. A. (2000). Multidimensional Scaling, 2nd ed. Boca Raton: Chapman & Hall/CRC.Google Scholar
Crawley, M. J. (2007). The R Book. Chichester: Wiley.Google Scholar
Cressey, G. & Schofield, P. F. (1996). Rapid whole-pattern profile-stripping method for the quantification of multiphase samples. Powder Diffr. 11, 35–39.Google Scholar
Dong, W., Gilmore, C., Barr, G., Dallman, C., Feeder, N. & Terry, S. (2008). A quick method for the quantitative analysis of mixtures. 1. Powder X-ray diffraction. J. Pharm. Sci. 97, 2260–2276.Google Scholar
Donoho, D. L. & Johnstone, I. M. (1995). Adapting to unknown smoothness via wavelet shrinkage. J. Am. Stat. Assoc. 90, 1200–1224.Google Scholar
Everitt, B. S., Landau, S. & Leese, M. (2001). Cluster Analysis, 4th ed. London: Arnold.Google Scholar
Gilmore, C. J., Barr, G. & Paisley, W. (2004). High-throughput powder diffraction. I. A new approach to qualitative and quantitative powder diffraction pattern analysis using full pattern profiles. J. Appl. Cryst. 37, 231–242.Google Scholar
Goodman, L. A. & Kruskal, W. H. (1954). Measures of association for cross-classifications. J. Am. Stat. Assoc. 49, 732–764.Google Scholar
Gordon, A. D. (1981). Classification, 1st ed., pp. 46–49. London: Chapman and Hall.Google Scholar
Gordon, A. D. (1999). Classification, 2nd ed. Boca Raton: Chapman and Hall/CRC.Google Scholar
Gower, J. C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53, 325–328.Google Scholar
Gower, J. C. & Dijksterhuis, G. B. (2004). Procrustes Problems. Oxford University Press.Google Scholar
Herrmann, M. J. & Engel, W. (1997). Phase transitions and lattice dynamics of ammonium nitrate. Propellants Explos. Pyrotech. 22, 143–147.Google Scholar
Husson, F. & Pagès, J. (2006). INDSCAL model: geometrical interpretation and methodology. Comput. Stat. Data Anal. 50, 358–378.Google Scholar
ICDD (2018). The Powder Diffraction File. International Centre for Diffraction Data, 12 Campus Boulevard, Newton Square, Pennsylvania 19073-3273, USA.Google Scholar
Inselberg, A. (1985). The plane with parallel coordinates. Vis. Comput. 1, 69–91.Google Scholar
Inselberg, A. (2009). Parallel Coordinates. Visual multidimensional geometry and its applications. New York: Springer.Google Scholar
Kaufman, L. & Rousseeuw, P. J. (1990). Finding Groups in Data. New York: Wiley.Google Scholar
Kojima, T., Onoue, S., Murase, N., Katoh, F., Mano, T. & Matsuda, Y. (2006). Crystalline form information from multiwell plate salt screening by use of Raman microscopy. Pharm. Res. 23, 806–812.Google Scholar
Kruskal, J. B. (1956). On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Am. Math. Soc. 7, 48–50.Google Scholar
Lance, G. N. & Williams, W. T. (1967). A general theory of classificatory sorting strategies: 1. Hierarchical systems. Comput. J. 9, 373–380.Google Scholar
Leroux, J., Lennox, D. H. & Kay, K. (1953). Direct quantitative X-ray analysis by diffraction absorption technique. Anal. Chem. 25, 740–743.Google Scholar
Mehrens, S. M., Kale, U. J. & Qu, X. (2005). Statistical analysis of differences in the Raman spectra of polymorphs. J. Pharm. Sci. 94, 1354–1367.Google Scholar
Milligan, G. W. & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50, 159–179.Google Scholar
Nelder, J. A. & Mead, R. (1965). A simplex method for function minimization. Comput. J. 7, 308–313.Google Scholar
O'Brien, L. E., Timmins, P., Williams, A. C. & York, P. (2004). Use of in situ FT-Raman spectroscopy to study the kinetics of the transformation of carbamazepine polymorphs. J. Pharm. Biomed. Anal. 36, 335–340.Google Scholar
Ogden, R. T. (1997). Essential Wavelets for Statistical Applications and Data Analysis, pp. 144–148. Boston: Birkhäuser.Google Scholar
Press, W. H., Teukolsky, S. A., Vetterling, W. T. & Flannery, B. P. (2007). Numerical Recipes. 3rd ed. Cambridge University Press..Google Scholar
Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65.Google Scholar
Sato, M., Sato, Y. & Jain, L. C. (1966). Fuzzy Clustering Models and Applications. New York: Physica-Verlag.Google Scholar
Smith, D. K., Johnson, G. G. & Wims, A. M. (1988). Use of full diffraction spectra, both experimental and calculated, in quantitative powder diffraction analysis. Aust. J. Phys. 41, 311–321.Google Scholar
Smrčok, Ĺ., Ďurík, M. & Jorík, V. (1999). Wavelet denoising of powder diffraction patterns. Powder Diffr. 14, 300–304.Google Scholar
Spearman, C. (1904). The proof and measurement of association between two things. Am. J. Psychol. 15, 72–101.Google Scholar
Storey, R., Docherty, R., Higginson, P., Dallman, C., Gilmore, C., Barr, G. & Dong, W. (2004). Automation of solid form screening procedures in the pharmaceutical industry–how to avoid the bottlenecks. Crystallogr. Rev. 10, 45–56.Google Scholar
Wegman, E. J. (1990). Hyperdimensional data analysis using parallel coordinates. J. Am. Stat. Assoc. 85, 664–675.Google Scholar








































to end of page
to top of page