International
Tables for
Crystallography
Volume H
Powder diffraction
Edited by C. J. Gilmore, J. A. Kaduk and H. Schenk

International Tables for Crystallography (2018). Vol. H, ch. 3.8, pp. 331-333

Section 3.8.5. Further validating and visualizing clusters: silhouettes and fuzzy clustering

C. J. Gilmore,a G. Barra and W. Donga*

aDepartment of Chemistry, University of Glasgow, University Avenue, Glasgow, G12 8QQ, UK
Correspondence e-mail:  chris@chem.gla.ac.uk

3.8.5. Further validating and visualizing clusters: silhouettes and fuzzy clustering

| top | pdf |

Other techniques exist to validate the clusters, and these are discussed here.

3.8.5.1. Silhouettes

| top | pdf |

Silhouettes (Rousseeuw, 1987[link]; Kaufman & Rousseeuw, 1990[link]) are a property of every member of a cluster and define a coefficient of cluster membership. To compute them, the dissimilarity matrix, δ, is used. If the pattern i belongs to cluster Cr which contains nr patterns, we define[{a_i} = \sum\limits_{\scriptstyle j \in {C_r} \hfill \atop \scriptstyle j \ne i \hfill} \delta _{ij} / {\left({{n_r} - 1} \right)}. \eqno(3.8.19)]This defines the average dissimilarity of pattern i to all the other patterns in cluster Cr. Further define[{b_i} = {{\rm min}_{s \ne r}}\left\{ {\sum\limits_{j \in {C_s}} {{\delta _{ij}}/{n_s}} } \right\} .\eqno(3.8.20)]The silhouette for pattern i is then[{h_i}\, = \,{{{b_i} - {a_i}} \over {\max \left({{a_i},\,{b_i}} \right)}}. \eqno(3.8.21)]Clearly [-1\leq h_i \leq 1.0]. It is not possible to define silhouettes for clusters with only one member (singleton clusters). Silhouettes are displayed such that each cluster is represented as a histogram of frequency plotted against silhouette values so that one can look for outliers or poorly connected plots.

From our experience with powder data collected in reflection mode on both organic and inorganic samples (Barr et al., 2004b[link]), we conclude that for any given pattern

  • (1) [h_i\,\gt\, 0.5] implies that pattern i is probably correctly classified;

  • (2) [0.2 \,\lt\, h_i \,\lt\, 0.5] implies that pattern i should be inspected, since it may belong to a different or new cluster;

  • (3) [h_i \,\lt\, 0.2] implies that pattern i belongs to a different or new cluster.

The use of silhouettes in defining the details of the clustering is shown for the aspirin data in Fig. 3.8.8[link]. The silhouettes for the red cluster corresponding to the dendrogram in Fig. 3.8.6[link](a) are shown in Fig. 3.8.8[link](a) and those for the corresponding orange cluster are shown in Fig. 3.8.8[link](b). Both sets of silhouettes have values < 0.5, which indicates that the clustering is not optimally defined. When the cut line is moved to give the dendrogram in Fig. 3.8.6[link](c), the silhouettes for the red cluster are shown in Fig. 3.8.8[link](c). The entry centred on a silhouette value of 0.15 is pattern 3. This implies that pattern 3 is only loosely connected to the cluster and this is demonstrated in Fig. 3.8.8[link](d) where pattern 3 and the most representative pattern for the cluster (No. 9) are superimposed. Although there is a general sense of similarity there are significant differences and the combined correlation coefficient is only 0.62. In Fig. 3.8.8[link](e), the silhouettes for the orange cluster are shown. They imply that this is a single cluster without outliers. The silhouettes for the green cluster corresponding to the dendrogram in Fig. 3.8.6[link](c) are shown in Fig. 3.8.8[link](f). The clustering is poorly defined here.

3.8.5.2. Fuzzy clustering

| top | pdf |

In standard clustering methods a set of n diffraction patterns are partitioned into c disjoint clusters. Cluster membership is defined via a membership matrix U(n × c), where individual coefficients, uik, represent the membership of pattern i of cluster k. The coefficients are equal to unity if i belongs to c and zero otherwise, i.e.[{u_{ik}} \in \,\left [{0,1} \right]\quad(i = 1,\ldots,n\semi k = 1,\ldots,c). \eqno(3.8.22)]If these constraints are relaxed, such that[0 \le {u_{ik}} \le 1\quad\left({i = 1,\ldots,n\semi k = 1,\ldots,c} \right), \eqno(3.8.23)][0\, \lt\, \textstyle\sum\limits_{i = 1}^n {{u_{ik}} \,\lt\, n\quad \left({k = 1,\ldots,c} \right)}\eqno(3.8.24)]and[\textstyle\sum\limits_{k = 1}^c {u_{ik}^{}} = 1, \eqno(3.8.25)]then fuzzy clusters are generated, in which there is the possibility that a pattern can belong to more than one cluster (see, for example, Everitt et al., 2001[link]; Sato et al., 1966[link]). Such a situation is quite feasible in the case of powder diffraction, for example, when mixtures can be involved. It is described in detail by Barr et al. (2004b[link]).

3.8.5.3. The PolySNAP program and DIFFRAC.EVA

| top | pdf |

All these techniques have been incorporated into the PolySNAP computer program (Barr et al., 2004a[link],b[link],c[link]; Barr, Dong, Gilmore & Faber, 2004[link]; Barr, Dong & Gilmore, 2009[link]), which was developed from the SNAP-D software (Barr, Gilmore & Paisley, 2004[link]). PolySNAP has subsequently been incorporated into the Bruker DIFFRAC.EVA program (Bruker, 2018[link]), and the following sections are based on its use.

References

Barr, G., Dong, W. & Gilmore, C. J. (2004a). High-throughput powder diffraction. II. Applications of clustering methods and multivariate data analysis. J. Appl. Cryst. 37, 243–252.Google Scholar
Barr, G., Dong, W. & Gilmore, C. J. (2004b). High-throughput powder diffraction. IV. Cluster validation using silhouettes and fuzzy clustering. J. Appl. Cryst. 37, 874–882.Google Scholar
Barr, G., Dong, W. & Gilmore, C. J. (2009). PolySNAP3: a computer program for analysing and visualizing high-throughput data from diffraction and spectroscopic sources. J. Appl. Cryst. 42, 965–974.Google Scholar
Barr, G., Dong, W., Gilmore, C. & Faber, J. (2004). High-throughput powder diffraction. III. The application of full-profile pattern matching and multivariate statistical analysis to round-robin-type data sets. J. Appl. Cryst. 37, 635–642.Google Scholar
Barr, G., Gilmore, C. J. & Paisley, J. (2004). SNAP-1D: a computer program for qualitative and quantitative powder diffraction pattern analysis using the full pattern profile. J. Appl. Cryst. 37, 665–668.Google Scholar
Bruker (2018). DIFFRAC.EVA: software to evaluate X-ray diffraction data. Version 4.3. https://www.bruker.com/eva .Google Scholar
Everitt, B. S., Landau, S. & Leese, M. (2001). Cluster Analysis, 4th ed. London: Arnold.Google Scholar
Kaufman, L. & Rousseeuw, P. J. (1990). Finding Groups in Data. New York: Wiley.Google Scholar
Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65.Google Scholar
Sato, M., Sato, Y. & Jain, L. C. (1966). Fuzzy Clustering Models and Applications. New York: Physica-Verlag.Google Scholar








































to end of page
to top of page