cluster.stats {fpc} | R Documentation |
Computes a number of distance based statistics which can be used for cluster validation, comparison between clusterings and decision about the number of clusters: cluster sizes, cluster diameters, average distances within and between clusters, cluster separation, average silhouette widths, the best distance based statistics to decide about the number of clusters in a study of Milligan and Cooper (1985), Hubert's gamma coefficient, the Dunn index and the corrected rand index to assess the similarity of two clusterings.
cluster.stats(d,clustering,alt.clustering=NULL, silhouette=TRUE,G2=FALSE,G3=FALSE)
d |
a distance object (as generated by dist ) or a distance
matrix between cases. |
clustering |
an integer vector of length of the number of cases, which indicates a clustering. The clusters have to be numbered from 1 to the number of clusters. |
alt.clustering |
an integer vector such as for
clustering , indicating an alternative clustering. If provided, the
corrected rand index for clustering
vs. alt.clustering is computed. |
silhouette |
logical. If TRUE , the silhouette statistics
are computed, which requires package cluster . |
G2 |
logical. If TRUE , Goodman and Kruskal's index G2
(cf. Gordon (1999), p. 62) is computed. This executes lots of
sorting algorithms and can be very slow (it has been improved
by R. Francois - thanks!) |
G3 |
logical. If TRUE , the index G3
(cf. Gordon (1999), p. 62) is computed. This executes sort
on all distances and can be extremely slow. |
cluster.stats
returns a list containing the components
n, cluster.number, cluster.size, diameter,
average.distance, median.distance, separation, average.toother,
separation.matrix, average.between, average.within,
n.between, n.within, clus.avg.silwidths, avg.silwidth,
g2, g3, hubertgamma, dunn, wb.ratio, corrected.rand
.
n |
number of cases. |
cluster.number |
number of clusters. |
cluster.size |
vector of cluster sizes (number of points). |
diameter |
vector of cluster diameters (maximum within cluster distances). |
average.distance |
vector of clusterwise within cluster average distances. |
median.distance |
vector of clusterwise within cluster distance medians. |
separation |
vector of clusterwise minimum distances of a point in the cluster to a point of another cluster. |
average.toother |
vector of clusterwise average distances of a point in the cluster to the points of other clusters. |
separation.matrix |
matrix of separation values between all pairs of clusters. |
average.between |
average distance between clusters. |
average.within |
average distance within clusters. |
n.between |
number of distances between clusters. |
n.within |
number of distances within clusters. |
clus.avg.silwidths |
vector of cluster average silhouette
widths. See
silhouette . |
avg.silwidth |
average silhouette
width. See
silhouette . |
g2 |
Goodman and Kruskal's Gamma coefficient. See Milligan and Cooper (1985), Gordon (1999, p. 62). |
g3 |
G3 coefficient. See Gordon (1999, p. 62). |
hubertgamma |
correlation between distances and a 0-1-vector where 0 means same cluster, 1 means different clusters. See Haldiki et al. (2002). |
dunn |
minimum separation / maximum diameter. Dunn index, see Haldiki et al. (2002). |
wb.ratio |
average.within/average.between . |
corrected.rand |
corrected rand index (if alt.clustering
has been specified), see Gordon (1999, p. 198). |
Christian Hennig chrish@stats.ucl.ac.uk http://www.homepages.ucl.ac.uk/~ucakche/
Gordon, A. D. (1999) Classification, 2nd ed. Chapman and Hall.
Haldiki, M., Batistakis, Y., Vazirgiannis, M. (2002) Cluster validity methods, SIGMOD, Record 31, 40-45.
Milligan, G. W. and Cooper, M. C. (1985) An examination of procedures for determining the number of clusters. Psychometrika, 50, 159-179.
silhouette
, dist
clusterboot
computes clusterwise stability statistics by
resampling.
set.seed(20000) face <- rFace(200,dMoNo=2,dNoEy=0,p=2) dface <- dist(face) complete3 <- cutree(hclust(dface),3) cluster.stats(dface,complete3, alt.clustering=as.integer(attr(face,"grouping")))