cls.stab.sim.ind {clv}R Documentation

Cluster Stability - Similarity Index and Pattern-wise Stability Approaches

Description

cls.stab.sim.ind and cls.stab.opt.assign reports validation measures for clustering results. Both functions return lists of cluster stability results computed according to similarity index and pattern-wise stability approaches.

Usage

cls.stab.sim.ind( data, cl.num, rep.num, subset.ratio, clust.method, method.type, sim.ind.type, fast, ... )
cls.stab.opt.assign( data, cl.num, rep.num, subset.ratio, clust.method, method.type, fast, ... )

Arguments

data numeric matrix or data.frame where columns correspond to variables and rows to observations.
cl.num integer vector with information about numbers of cluster to which data will be partitioned. If vector is not an integer type, it will be coerced with warning.
rep.num integer number which tells how many pairs of data subsets will be partitioned for particualar number of clusters. The results of partitioning for given pair of subsets is used to compute similarity indicies (in case of cls.stab.sim.ind) or pattern-wise stability (in case of cls.stab.opt.assign, for more details see references). By default rep.num value is 10.
subset.ratio a number comming from (0,1) section which tells how big data subsets should be. 0 means empty subset, 1 means all data. By default subset.ratio is set to 0.75
clust.method string vector with names of cluster alghorithms to be used. Available are: "agnes", "diana", "hclust", "kmeans", "pam", "clara". Combinations are also possible. By default c("agnes","pam") vector is applied.
method.type string vector with information useful only in context of "agnes" and "hclust" algorithms . Available are: "single", "average", "complete", "ward" and "weighted" (for more details see agnes, hclust ). The last type is aplicable only for "agnes". Combinations are also possible. By default c("single","average") vector is applied.
sim.ind.type string vector with information useful only for cls.stab.sim.ind function. User is able to choose which similarity idicies (external measures) to use to compare two partitionings. Available are: "dot.pr", "sim.ind", "rand", "jaccard" (for more details see similarity.index, dot.product, std.ext). Combinations are also possible. By default c("dot.pr","sim.ind") vector is applied.
fast logical argument which sets the way of computing cluster stability for hierarhical algorithms. By default it is set to TRUE, which means that each result produced by hierarhical algorithm is partitioned for the number of clusters choosen in cl.num argument and given clustering results are put for further computation. In this way computation of cluster stability is faster.
... additional parameters for clustering algorithms. Note: use with caution! Different clustering methods choosen in clust.method have different set of parameter names - mixing them often disallow any cluster algorithm to run.

Details

Both functions realize cluster stability approaches described in Detecting stable clusters using principal component analysis (see references).

The cls.stab.sim.ind function realizes alghorithm given in chaper 3.1 where only cosine similarity index (see dot.product) is introduced as a similarity index between two different partitionings. This function realize this cluster stability approach also for other similarity indicies such us similarity.index, clv.Rand and clv.Jaccard. The important thing is that similarity index (if choosen) produced by this function is not exactly the same as index produced by similarity.index function. The value of the similarity.index is a number which depends on number of clusters. Eg. if two "n-clusters" partitionings are compared the value always will be a number which belong to the [1/n, 1] section. That means the results produced by this similarity index are not comparable for different number of clusters. That's why each result is scaled thanks to the linear function f:[1/n, 1] -> [0, 1] where "n" is a number of clusters. The resutls' layout is described in Value section.

The cls.stab.opt.assign function realizes alghorithm given in chaper 3.2 where pattern-wise agreement and pattern-wise stability was intoduced. Function returns the lowest pattern-wise stability value for given number of clusters. The resutls' layout is described in Value section.

It often happens that clustering algorithms can't produce amount of clusters that user wants. In this situation only the warning is produceded and cluster stability is computed for partitionings with unequal number of clusters.

Value

cls.stab.sim.ind returns a list of lists of matricies. Each matrix consists of the set of external similarity indicies (which one similarity index see below) where number of collumns is equal to cl.num vector length and row number is equal to rep.num value what means that each collumn contain a set of similarity indicies computed for fixed number of clusters. The order of the matricies depends on three input arguments: clust.method, method.type, and sim.ind.type. Combination of clust.method and method.type give a names for elements listed in the first list. Each element of this list is also a list type where each element name correspond to one of similarity index type choosen thanks to sim.ind.type argument. The order of the names exactly match to the order given in those arguments description. It is easy to understand after considering the following example.
Let say we are running cls.stab.sim.ind with default arguments then the results will be given in the following order: $agnes.single$dot.pr, $agnes.single$sim.ind, $agnes.average$dot.pr, $agnes.average$sim.ind, $pam$dot.pr, $pam$sim.ind.


cls.stab.opt.assign returns a list of vectors. Each vector consists of the set of cluster stability indicies described in Detecting stable clusters using principal component analysis (see references). Vector length is equal to cl.num vector length what means that each position in vector is assigned to proper clusters' number given in cl.num argument. The order of the vectors depends on two input arguments: clust.method, method.type. The order of the names exactly match to the order given in arguments description. It is easy to understand after considering the following example.
Let say we are running cls.stab.opt.assign with c("pam", "kmeans", "hclust", "agnes") as clust.method and c("ward","average") as method.type then the results will be given in the following order: $hclust.average, $hclust.ward, $agnes.average, $agnes.ward, $kmeans, $pam.

Author(s)

Lukasz Nieweglowski

References

A. Ben-Hur and I. Guyon Detecting stable clusters using principal component analysis, http://citeseer.ist.psu.edu/528061.html

C. D. Giurcaneanu, I. Tabus, I. Shmulevich, W. Zhang Stability-Based Cluster Analysis Applied To Microarray Data, http://citeseer.ist.psu.edu/577114.html.

T. Lange, V. Roth, M. L. Braun and J. M. Buhmann Stability-Based Validation of Clustering Solutions, ml-pub.inf.ethz.ch/publications/papers/2004/lange.neco_stab.03.pdf

See Also

Functions that compare two different partitionings: clv.Rand, dot.product,similarity.index.

Examples


# load and prepare data
library(clv)
data(iris)
iris.data <- iris[,1:4]

# fix arguments for cls.stab.* function
iter = c(2,3,4,5,6,7,9,12,15)
smp.num = 5
ratio = 0.8

res1 = cls.stab.sim.ind( iris.data, iter, rep.num=smp.num, subset.ratio=0.7, sim.ind.type=c("rand","dot.pr","sim.ind") )
res2 = cls.stab.opt.assign( iris.data, iter, clust.method=c("hclust","kmeans"), method.type=c("single","average") )

print(res1)
boxplot(res1$agnes.average$sim.ind)
plot(res2$hclust.single)


[Package clv version 0.2 Index]