skmeans {skmeans} | R Documentation |
Partition given vectors x_b by minimizing the criterion sum_{b,j} w_b u_{bj}^m d(x_b, p_j) where the w_b are case weights, u_{bj} is the membership of x_b to class j, p_j is the prototype of class j (thus minimizing sum_b w_b u_{bj}^m d(x_b, p) over p), and d is the cosine dissimilarity d(x, p) = 1 - cos(x, p).
skmeans(x, k, method = NULL, m = 1, weights = 1, control = list())
x |
A numeric data matrix, with rows corresponding to the objects to be partitioned (such that row b contains x_b). Can be a dense matrix, a simple triplet matrix (package slam), or a dgTMatrix (package Matrix). |
k |
an integer giving the number of classes to be used in the partition. |
method |
a character string specifying one of the built-in
methods for computing spherical k-means partitions, or a
function to be taken as a user-defined method, or NULL
(default value). If a character string, its lower-cased version is
matched against the lower-cased names of the available built-in
methods using pmatch . See Details for
available built-in methods and defaults. |
m |
a number not less than 1 controlling the softness of the partition (as the “fuzzification parameter” of the fuzzy c-means algorithm). The default value of 1 corresponds to hard partitions obtained from the standard spherical k-means problem; values greater than one give partitions of increasing softness obtained from a generalized fuzzy spherical k-means problem. |
weights |
a numeric vector of non-negative case weights.
Recycled to the number of elements given by x if necessary. |
control |
a list of control parameters. See Details. |
The “simple” spherical k-means problem where all case weights are one and m = 1 is equivalent to maximizing the criterion sum_j sum_{b in C_j} cos(x_b, p_j), where C_j is the j-th class of the partition. This is the formulation used in Dhillon & Modha (2001) and related references, and is criterion function mathcal{I}_2 in the CLUTO documentation.
Obtaining optimal spherical k-means partitions obviously is a computationally hard problem, and several methods are available which attempt to obtain optimal partitions. The built-in methods are as follows.
"genetic"
"pclust"
pclust
in package
clue."CLUTO"
vcluster
partitional
clustering program from CLUTO, the CLUstering TOolkit by George
Karypis."lih"
"lihc"
Method "pclust"
is the only method available for soft spherical
k-means problems. Method "genetic"
can handle case
weights. By default, the genetic algorithm is used for obtained hard
partitions, and the fixed-point algorithm otherwise.
Control parameters for method "genetic"
are as follows.
maxiter
popsize
mutations
start
reltol
verbose
getOption("verbose")
.
See the documentation of pclust
in package
clue for the control parameters for method "pclust"
.
Control parameters for method "CLUTO"
are as follows.
vcluster
vcluster
executable.colmodel
verbose
control
vcluster
executable.Control parameters for the local improvement heuristics are as follows.
maxiter
start
reltol
verbose
The enhanced chain-based heuristic has the following additional control parameter.
maxchains
Method "CLUTO"
requires that the CLUTO vcluster
executable is available. CLUTO binaries for the Linux, Sun, OSX, and
MS Windows platforms can be obtained from
http://www-users.cs.umn.edu/~karypis/cluto/.
User-defined methods must have formals x
, k
and
control
, and optionally may have formals weights
or m
if providing support for case weights or soft spherical
k-means partitions, respectively.
An object of class pclust
(see the information on
pclust objects in package clue for
further details) representing the obtained spherical k-means
partition, which is a list with components including the following:
prototypes |
a dense matrix with k rows giving the
prototypes. |
membership |
cluster membership as a matrix with k
columns. |
cluster |
the class ids of the closest hard partition (the partition itself if m = 1). |
value |
the value of the criterion. |
Kurt Hornik Kurt.Hornik@wu.ac.at,
Ingo Feinerer feinerer@logic.at,
Martin Kober martin.kober@wu.ac.at.
I. S. Dhillon and D. S. Modha (2001). Concept decompositions for large sparse text data using clustering. Machine Learning, 42, 143–175.
I. S. Dhillon and Y. Guan and J. Kogan (2002). Iterative clustering of high dimensional text data augmented by local search. In Proceedings of the Second IEEE International Conference on Data Mining, pages 131–138. http://www.cs.utexas.edu/users/inderjit/public_papers/iterative_icdm02.pdf.
K. Krishna and M. Narasimha Murty (1999). Genetic K-means algorithm. IEEE Transactions on Systems, Man, and Cybernetics — Part B: Cybernetics, 29/3, 433–439. http://eprints.iisc.ernet.in/2937/1/genetic-k.pdf.
G. Karypis (2003). CLUTO: A Clustering Toolkit. Technical Report #02-017, Department of Computer Science, University of Minnesota. http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/manual.pdf.
## Use CLUTO dataset 're0': x <- readCM(system.file("cluto", "re0.mat", package = "skmeans"), system.file("cluto", "re0.mat.clabel", package = "skmeans")) ## Which is not really small: dim(x) ## Partition into 5 clusters. party <- skmeans(x, 5, control = list(verbose = TRUE)) ## Criterion value obtained: party$value ## Compare with "true" classifications: class_ids <- readLines(system.file("cluto", "re0.mat.rclass", package = "skmeans")) table(class_ids, party$cluster)