clusterboot {fpc} | R Documentation |
Assessment of the clusterwise stability of a clustering of data, which can be cases*variables or dissimilarity data. The data is resampled using several schemes (bootstrap, subsetting, jittering, replacement of points by noise) and the Jaccard similarities of the original clusters to the most similar clusters in the resampled data are computed. The mean over these similarities is used as an index of the stability of a cluster (other statistics can be computed as well). The methods are described in Hennig (2007).
clusterboot
is an integrated function that computes the
clustering as well, using interface functions for various
clustering methods implemented in R (several interface functions are
provided, but you can
implement further ones for your favourite clustering method). See the
documentation of the input parameter clustermethod
below.
Quite general clustering methods are possible, i.e. methods estimating or fixing the number of clusters, methods producing overlapping clusters or not assigning all cases to clusters (but declaring them as "noise"). Fuzzy clusterings cannot be processed and have to be transformed to crisp clusterings by the interface function.
clusterboot(data,B=100, distances=(class(data)=="dist"), bootmethod=if(distances) "boot" else c("boot","noise"), bscompare=FALSE, multipleboot=TRUE, jittertuning=0.05, noisetuning=c(0.05,4), subtuning=floor(nrow(data)/2), clustermethod,noisemethod=FALSE,count=TRUE, showplots=FALSE,dissolution=0.5, recover=0.75,...) ## S3 method for class 'clboot': print(x,statistics=c("mean","dissolution","recovery"),...) ## S3 method for class 'clboot': plot(x,xlim=c(0,1),breaks=seq(0,1,by=0.05),...)
data |
something that can be coerced into a matrix. The data
matrix - either an n*p -data matrix (or data frame) or an
n*n -dissimilarity matrix (or dist -object). |
B |
integer. Number of resampling runs for each scheme, see
bootmethod . |
distances |
logical. If TRUE , the data is interpreted as
dissimilarity matrix. If data is a dist -object,
distances=TRUE automatically, otherwise
distances=FALSE by default. This means that you have to set
it to TRUE manually if data is a dissimilarity matrix. |
bootmethod |
vector of strings, defining the methods used for
resampling. Possible methods:
"boot" : nonparametric bootstrap (precise behaviour is
controlled by parameters bscompare and
multipleboot ).
"subset" : selecting random subsets from the dataset. Size
determined by subtuning .
"noise" : replacing a certain percentage of the points by
random noise, see noisetuning .
"jitter" add random noise to all points, see
jittertuning . (This didn't perform well in Hennig (2007),
but you may want to get your own experience.)
"bojit" nonparametric bootstrap first, and then adding
noise to the points, see jittertuning .
Important: only the methods "boot" and
"subset" work with dissimilarity data!
The results in Hennig (2007) indicate that "boot" is
generally informative and often quite similar to "subset" and
"bojit" , while "noise" sometimes provides different
information. Therefore the default (for distances=FALSE ) is
to use "boot" and "noise" . However, some clustering
methods may have problems with multiple points, which can be solved
by using "bojit" or "subset" instead of "boot" or by
multipleboot=FALSE below. |
bscompare |
logical. If TRUE , multiple points in the
bootstrap sample are taken into account to compute the Jaccard
similarity to the original clusters (which are represented by their
"bootstrap versions", i.e., the
points of the original cluster which also occur in the bootstrap
sample). If a point was drawn more than once, it is in the "bootstrap
version" of the original cluster more than once, too, if
bscompare=TRUE . Otherwise (default) multiple points are
ignored for the computation of the Jaccard similarities. If
multipleboot=FALSE , it doesn't make a difference. |
multipleboot |
logical. If FALSE , all points drawn more
than once in the bootstrap draw are only used once in the bootstrap
samples. |
jittertuning |
positive numeric. Tuning for the
"jitter" -method. The noise distribution for
jittering is a normal distribution with zero mean. The covariance
matrix has the same Eigenvectors as that of the original
data set, but the standard deviation along the principal directions is
determined by the jittertuning -quantile of the distances
between neighboring points projected along these directions. |
noisetuning |
A vector of two positive numerics. Tuning for the
"noise" -method. The first component determines the
probability that a point is replaced by noise. Noise is generated by
a uniform distribution on a hyperrectangle along the principal
directions of the original data set, ranging from
-noisetuning[2] to noisetuning[2] times the standard
deviation of the data set along the respective direction. Note that
only points not replaced by noise are considered for the computation
of Jaccard similarities. |
subtuning |
integer. Size of subsets for "subset" . |
clustermethod |
an interface function (the function name, not a string containing the name, has to be provided!). This defines the clustering method. See the "Details"-section for a list of available interface functions and guidelines how to write your own ones. |
noisemethod |
logical. If TRUE , the last cluster is
regarded as "noise component", which means that for computing the Jaccard
similarity, it is not treated as a cluster. The noise component of
the original clustering is only compared with the noise component of
the clustering of the resampled data. (Some cluster methods such as
trimmed k-means and EMclustN produce such noise
components). |
count |
logical. If TRUE , the resampling runs are counted
on the screen. |
showplots |
logical. If TRUE , a plot of the first two
dimensions of the resampled data set (or the classical MDS solution
for dissimilarity data) is shown for every resampling run. The last
plot shows the original data set. |
dissolution |
numeric between 0 and 1. If the Jaccard similarity between the resampling version of the original cluster and the most similar cluster on the resampled data is smaller or equal to this value, the cluster is considered as "dissolved". Numbers of dissolved clusters are recorded. |
recover |
numeric between 0 and 1. If the Jaccard similarity between the resampling version of the original cluster and the most similar cluster on the resampled data is larger than this value, the cluster is considered as "successfully recovered". Numbers of recovered clusters are recorded. |
... |
additional parameters for the clustermethods called by
clusterboot . No effect in print.clboot and
plot.clboot . |
x |
object of class clboot . |
statistics |
specifies in print.clboot ,
which of the three clusterwise Jaccard
similarity statistics "mean" , "dissolution" (number of
times the cluster has been dissolved) and "recovery" (number
of times a cluster has been successfully recovered) is printed. |
xlim |
transferred to hist . |
breaks |
transferred to hist . |
Here are some guidelines for interpretation. There is some theoretical justification to consider a Jaccard similarity value smaller or equal to 0.5 as an indication of a "dissolved cluster", see Hennig (2004). Generally, a valid, stable cluster should yield a mean Jaccard similarity value of 0.75 or more. Between 0.6 and 0.75, clusters may be considered as indicating patterns in the data, but which points exactly should belong to these clusters is highly doubtful. Below average Jaccard values of 0.6, clusters should not be trusted. "Highly stable" clusters should yield average Jaccard similarities of 0.85 and above. All of this refers to bootstrap; for the other resampling schemes it depends on the tuning constants, though their default values should grant similar interpretations in most cases.
While B=100
is recommended, smaller run numbers could give
quite informative results as well, if computation times become too high.
Note that the stability of a cluster is assessed, but
stability is not the only important validity criterion - clusters
obtained by very inflexible clustering methods may be stable but not
valid, as discussed in Hennig (2007).
See plotcluster
for graphical cluster validation.
Information about interface functions for clustering methods:
The following interface functions are currently
implemented (in the present package; note that almost all of these
functions require the specification of some control parameters, so
if you use one of them, look up their common help page
kmeansCBI
) first:
kmeans
for k-means clustering. This assumes a
cases*variables matrix as input.hclust
for agglomerative hierarchical clustering with
optional noise component. This
function produces a partition and assumes a cases*variables
matrix as input.hclust
for agglomerative hierarchical clustering. This
function produces a tree (not only a partition; therefore the
number of clusters can be huge!) and assumes a cases*variables
matrix as input.hclust
for agglomerative hierarchical clustering with
optional noise component. This
function produces a partition and assumes a dissimilarity
matrix as input.EMclust
and
EMclustN
, for normal mixture model based
clustering. This assumes a cases*variables matrix as
input. Warning: EMclust
and
EMclustN
often have problems with multiple
points. It is recommended to use this only together with
multipleboot=FALSE
.EMclust
and
EMclustN
, for normal mixture model based
clustering. This assumes a dissimilarity matrix as input and
generates a data matrix by multidimensional scaling first.
Warning: EMclust
and
EMclustN
often have problems with multiple
points. It is recommended to use this only together with
multipleboot=FALSE
.pam
and clara
for partitioning around medoids. This can be used with
cases*variables as well as dissimilarity matrices as input.pamk
for partitioning around medoids. The number
of cluster is estimated by the average silhouette width.
This can be used with
cases*variables as well as dissimilarity matrices as input.trimkmeans
for trimmed k-means
clustering. This assumes a cases*variables matrix as input.trimkmeans
for trimmed k-means
clustering. This assumes a dissimilarity matrix as input and
generates a data matrix by multidimensional scaling first.dbscan
for density based
clustering. This can be used with
cases*variables as well as dissimilarity matrices as input..fixmahal
for fixed point
clustering. This assumes a cases*variables matrix as input.You can write your own interface function. The first argument of an interface function should always be a data matrix (of class "matrix", but it may be a symmetrical dissimilarity matrix). Further arguments can be tuning constants for the clustering method. The output of an interface function should be a list containing (at least) the following components:
nc
includes the
noise component, and there should be another component
nccl
, being the number of clusters not including the
noise component (note that it is not mandatory to define a noise
component if not all points are assigned to clusters, but if you
do it, the stability of the noise component is assessed as
well.)n
) for each cluster,
indicating whether a point is a member of this cluster
(TRUE
) or not. If a noise component is included, it
should always be the last vector in this list.n
,
partitioning the data. If the method produces a partition, it
should be the clustering. This component is only used for plots,
so you could do something like rep(1,n)
for
non-partitioning methods.
clusterboot
returns an object of class "clboot"
, which
is a list with components
result, partition, nc, clustermethod, B, bootmethod,
multipleboot, dissolution, recover, bootresult, bootmean, bootbrd,
bootrecover, jitterresult, jittermean, jitterbrd, jitterrecover,
subsetresult, subsetmean, subsetbrd, subsetrecover, bojitresult,
bojitmean, bojitbrd, bojitrecover, noiseresult, noisemean,
noisebrd, noiserecover
.
result |
clustering result; full output of the selected
clustermethod for the original data set. |
partition |
partition parameter of the selected clustermethod
(note that this is only meaningful for partitioning clustering methods). |
nc |
number of clusters in original data (including noise
component if noisemethod=TRUE ). |
clustermethod, B, bootmethod, multipleboot, dissolution,
recover |
input parameters, see above. |
bootresult |
matrix of Jaccard similarities for
bootmethod="boot" . Rows correspond to clusters in the
original data set. Columns correspond to bootstrap runs. |
bootmean |
clusterwise means of the bootresult . |
bootbrd |
clusterwise number of times a cluster has been dissolved. |
bootrecover |
clusterwise number of times a cluster has been successfully recovered. |
subsetresult, subsetmean, etc. |
same as bootresult,
bootmean, etc. , but for the other resampling methods. |
Christian Hennig chrish@stats.ucl.ac.uk http://www.homepages.ucl.ac.uk/~ucakche/
Hennig, C. (2004) A general robustness and stability theory for cluster analysis, Preprint 2004-07, Fachbereich Mathematik - SPST, Hamburg. http://www.homepages.ucl.ac.uk/~ucakche/papers/classbrd.ps
Hennig, C. (2007) Cluster-wise assessment of cluster stability. Computational Statistics and Data Analysis, tentatively accepted.
dist
,
interface functions:
kmeansCBI
, hclustCBI
,
hclusttreeCBI
, disthclustCBI
,
noisemclustCBI
, distnoisemclustCBI
,
claraCBI
, pamkCBI
,
trimkmeansCBI
, disttrimkmeansCBI
,
dbscanCBI
, mahalCBI
set.seed(20000) face <- rFace(50,dMoNo=2,dNoEy=0,p=2) cf1 <- clusterboot(face,B=5,bootmethod= c("boot","noise","jitter"),clustermethod=kmeansCBI, k=5) print(cf1) plot(cf1) cf2 <- clusterboot(dist(face),B=5,bootmethod= "subset",clustermethod=disthclustCBI, k=5, cut="number", method="average", showplots=TRUE) print(cf2)