International
Tables for Crystallography Volume H Powder diffraction Edited by C. J. Gilmore, J. A. Kaduk and H. Schenk © International Union of Crystallography 2018 |
International Tables for Crystallography (2018). Vol. H, ch. 3.8, pp. 331-333
Section 3.8.5. Further validating and visualizing clusters: silhouettes and fuzzy clustering^{a}Department of Chemistry, University of Glasgow, University Avenue, Glasgow, G12 8QQ, UK |
Other techniques exist to validate the clusters, and these are discussed here.
Silhouettes (Rousseeuw, 1987; Kaufman & Rousseeuw, 1990) are a property of every member of a cluster and define a coefficient of cluster membership. To compute them, the dissimilarity matrix, δ, is used. If the pattern i belongs to cluster C_{r} which contains n_{r} patterns, we defineThis defines the average dissimilarity of pattern i to all the other patterns in cluster C_{r}. Further defineThe silhouette for pattern i is thenClearly . It is not possible to define silhouettes for clusters with only one member (singleton clusters). Silhouettes are displayed such that each cluster is represented as a histogram of frequency plotted against silhouette values so that one can look for outliers or poorly connected plots.
From our experience with powder data collected in reflection mode on both organic and inorganic samples (Barr et al., 2004b), we conclude that for any given pattern
The use of silhouettes in defining the details of the clustering is shown for the aspirin data in Fig. 3.8.8. The silhouettes for the red cluster corresponding to the dendrogram in Fig. 3.8.6(a) are shown in Fig. 3.8.8(a) and those for the corresponding orange cluster are shown in Fig. 3.8.8(b). Both sets of silhouettes have values < 0.5, which indicates that the clustering is not optimally defined. When the cut line is moved to give the dendrogram in Fig. 3.8.6(c), the silhouettes for the red cluster are shown in Fig. 3.8.8(c). The entry centred on a silhouette value of 0.15 is pattern 3. This implies that pattern 3 is only loosely connected to the cluster and this is demonstrated in Fig. 3.8.8(d) where pattern 3 and the most representative pattern for the cluster (No. 9) are superimposed. Although there is a general sense of similarity there are significant differences and the combined correlation coefficient is only 0.62. In Fig. 3.8.8(e), the silhouettes for the orange cluster are shown. They imply that this is a single cluster without outliers. The silhouettes for the green cluster corresponding to the dendrogram in Fig. 3.8.6(c) are shown in Fig. 3.8.8(f). The clustering is poorly defined here.
In standard clustering methods a set of n diffraction patterns are partitioned into c disjoint clusters. Cluster membership is defined via a membership matrix U(n × c), where individual coefficients, u_{ik}, represent the membership of pattern i of cluster k. The coefficients are equal to unity if i belongs to c and zero otherwise, i.e.If these constraints are relaxed, such thatandthen fuzzy clusters are generated, in which there is the possibility that a pattern can belong to more than one cluster (see, for example, Everitt et al., 2001; Sato et al., 1966). Such a situation is quite feasible in the case of powder diffraction, for example, when mixtures can be involved. It is described in detail by Barr et al. (2004b).
All these techniques have been incorporated into the PolySNAP computer program (Barr et al., 2004a,b,c; Barr, Dong, Gilmore & Faber, 2004; Barr, Dong & Gilmore, 2009), which was developed from the SNAP-D software (Barr, Gilmore & Paisley, 2004). PolySNAP has subsequently been incorporated into the Bruker DIFFRAC.EVA program (Bruker, 2018), and the following sections are based on its use.
References
Barr, G., Dong, W. & Gilmore, C. J. (2004a). High-throughput powder diffraction. II. Applications of clustering methods and multivariate data analysis. J. Appl. Cryst. 37, 243–252.Google ScholarBarr, G., Dong, W. & Gilmore, C. J. (2004b). High-throughput powder diffraction. IV. Cluster validation using silhouettes and fuzzy clustering. J. Appl. Cryst. 37, 874–882.Google Scholar
Barr, G., Dong, W. & Gilmore, C. J. (2009). PolySNAP3: a computer program for analysing and visualizing high-throughput data from diffraction and spectroscopic sources. J. Appl. Cryst. 42, 965–974.Google Scholar
Barr, G., Dong, W., Gilmore, C. & Faber, J. (2004). High-throughput powder diffraction. III. The application of full-profile pattern matching and multivariate statistical analysis to round-robin-type data sets. J. Appl. Cryst. 37, 635–642.Google Scholar
Barr, G., Gilmore, C. J. & Paisley, J. (2004). SNAP-1D: a computer program for qualitative and quantitative powder diffraction pattern analysis using the full pattern profile. J. Appl. Cryst. 37, 665–668.Google Scholar
Bruker (2018). DIFFRAC.EVA: software to evaluate X-ray diffraction data. Version 4.3. https://www.bruker.com/eva .Google Scholar
Everitt, B. S., Landau, S. & Leese, M. (2001). Cluster Analysis, 4th ed. London: Arnold.Google Scholar
Kaufman, L. & Rousseeuw, P. J. (1990). Finding Groups in Data. New York: Wiley.Google Scholar
Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65.Google Scholar
Sato, M., Sato, Y. & Jain, L. C. (1966). Fuzzy Clustering Models and Applications. New York: Physica-Verlag.Google Scholar