cluster.stats {fpc} | R Documentation |
Computes a number of distance based statistics which can be used for cluster validation, comparison between clusterings and decision about the number of clusters: cluster sizes, cluster diameters, average distances within and between clusters, cluster separation, average silhouette widths, the best distance based statistics to decide about the number of clusters in a study of Milligan and Cooper (1985), Hubert's gamma coefficient, the Dunn index and two indexes to assess the similarity of two clusterings, namely the corrected Rand index and Meila's VI.
cluster.stats(d,clustering,alt.clustering=NULL, silhouette=TRUE,G2=FALSE,G3=FALSE, compareonly=FALSE)
d |
a distance object (as generated by dist ) or a distance
matrix between cases. |
clustering |
an integer vector of length of the number of cases, which indicates a clustering. The clusters have to be numbered from 1 to the number of clusters. |
alt.clustering |
an integer vector such as for
clustering , indicating an alternative clustering. If provided, the
corrected Rand index and Meila's VI for clustering
vs. alt.clustering are computed. |
silhouette |
logical. If TRUE , the silhouette statistics
are computed, which requires package cluster . |
G2 |
logical. If TRUE , Goodman and Kruskal's index G2
(cf. Gordon (1999), p. 62) is computed. This executes lots of
sorting algorithms and can be very slow (it has been improved
by R. Francois - thanks!) |
G3 |
logical. If TRUE , the index G3
(cf. Gordon (1999), p. 62) is computed. This executes sort
on all distances and can be extremely slow. |
compareonly |
logical. If TRUE , only the corrected Rand index
and Meila's VI are
computed and given out (this requires alt.clustering to be
specified). |
cluster.stats
returns a list containing the components
n, cluster.number, cluster.size, diameter,
average.distance, median.distance, separation, average.toother,
separation.matrix, average.between, average.within,
n.between, n.within, within.cluster.ss, clus.avg.silwidths, avg.silwidth,
g2, g3, hubertgamma, dunn, entropy, wb.ratio,
corrected.rand, vi
except if compareonly=TRUE
, in which case
only the last two components are computed.
n |
number of cases. |
cluster.number |
number of clusters. |
cluster.size |
vector of cluster sizes (number of points). |
diameter |
vector of cluster diameters (maximum within cluster distances). |
average.distance |
vector of clusterwise within cluster average distances. |
median.distance |
vector of clusterwise within cluster distance medians. |
separation |
vector of clusterwise minimum distances of a point in the cluster to a point of another cluster. |
average.toother |
vector of clusterwise average distances of a point in the cluster to the points of other clusters. |
separation.matrix |
matrix of separation values between all pairs of clusters. |
average.between |
average distance between clusters. |
average.within |
average distance within clusters. |
n.between |
number of distances between clusters. |
n.within |
number of distances within clusters. |
within.cluster.ss |
a generalisation of the within clusters sum
of squares (k-means objective function), which is obtained if
d is a Euclidean distance matrix. For general distance
measures, this is half
the sum of the within cluster squared dissimilarities divided by the
cluster size. |
clus.avg.silwidths |
vector of cluster average silhouette
widths. See
silhouette . |
avg.silwidth |
average silhouette
width. See
silhouette . |
g2 |
Goodman and Kruskal's Gamma coefficient. See Milligan and Cooper (1985), Gordon (1999, p. 62). |
g3 |
G3 coefficient. See Gordon (1999, p. 62). |
hubertgamma |
correlation between distances and a 0-1-vector where 0 means same cluster, 1 means different clusters. See Haldiki et al. (2002). |
dunn |
minimum separation / maximum diameter. Dunn index, see Haldiki et al. (2002). |
entropy |
entropy of the distribution of cluster memberships, see Meila(2007). |
wb.ratio |
average.within/average.between . |
corrected.rand |
corrected Rand index (if alt.clustering
has been specified), see Gordon (1999, p. 198). |
vi |
variation of information (VI) index (if alt.clustering
has been specified), see Meila (2007). |
Christian Hennig chrish@stats.ucl.ac.uk http://www.homepages.ucl.ac.uk/~ucakche/
Gordon, A. D. (1999) Classification, 2nd ed. Chapman and Hall.
Haldiki, M., Batistakis, Y., Vazirgiannis, M. (2002) Cluster validity methods, SIGMOD, Record 31, 40-45.
Meila, M. (2007) Comparing clusterings?an information based distance, Journal of Multivariate Analysis, 98, 873-895.
Milligan, G. W. and Cooper, M. C. (1985) An examination of procedures for determining the number of clusters. Psychometrika, 50, 159-179.
silhouette
, dist
clusterboot
computes clusterwise stability statistics by
resampling.
set.seed(20000) face <- rFace(200,dMoNo=2,dNoEy=0,p=2) dface <- dist(face) complete3 <- cutree(hclust(dface),3) cluster.stats(dface,complete3, alt.clustering=as.integer(attr(face,"grouping")))