skmeans {skmeans}R Documentation

Compute Spherical k-Means Partitions

Description

Partition given vectors x_b by minimizing the criterion sum_{b,j} w_b u_{bj}^m d(x_b, p_j) where the w_b are case weights, u_{bj} is the membership of x_b to class j, p_j is the prototype of class j (thus minimizing sum_b w_b u_{bj}^m d(x_b, p) over p), and d is the cosine dissimilarity d(x, p) = 1 - cos(x, p).

Usage

skmeans(x, k, method = NULL, m = 1, weights = 1, control = list())

Arguments

x A numeric data matrix, with rows corresponding to the objects to be partitioned (such that row b contains x_b). Can be a dense matrix, a simple triplet matrix (package slam), or a dgTMatrix (package Matrix).
k an integer giving the number of classes to be used in the partition.
method a character string specifying one of the built-in methods for computing spherical k-means partitions, or a function to be taken as a user-defined method, or NULL (default value). If a character string, its lower-cased version is matched against the lower-cased names of the available built-in methods using pmatch. See Details for available built-in methods and defaults.
m a number not less than 1 controlling the softness of the partition (as the “fuzzification parameter” of the fuzzy c-means algorithm). The default value of 1 corresponds to hard partitions obtained from the standard spherical k-means problem; values greater than one give partitions of increasing softness obtained from a generalized fuzzy spherical k-means problem.
weights a numeric vector of non-negative case weights. Recycled to the number of elements given by x if necessary.
control a list of control parameters. See Details.

Details

The “simple” spherical k-means problem where all case weights are one and m = 1 is equivalent to maximizing the criterion sum_j sum_{b in C_j} cos(x_b, p_j), where C_j is the j-th class of the partition. This is the formulation used in Dhillon & Modha (2001) and related references, and is criterion function mathcal{I}_2 in the CLUTO documentation.

Obtaining optimal spherical k-means partitions obviously is a computationally hard problem, and several methods are available which attempt to obtain optimal partitions. The built-in methods are as follows.

"genetic"
a genetic algorithm patterned after the genetic k-means algorithm of Krishna & Narasimha Murty (1999).
"pclust"
a Lloyd-Forgy style fixed-point algorithm which iterates between determining optimal memberships for fixed prototypes, and computing optimal prototypes for fixed memberships, using the general-purpose prototype-based partitioning framework of pclust in package clue.
"CLUTO"
an interface to the vcluster partitional clustering program from CLUTO, the CLUstering TOolkit by George Karypis.
"lih"
the Local Improvement Heuristic of Dhillon, Guan and Kogan (2002). If a fixed-point iteration does not (sufficiently) improve the criterion, improvement by swapping the clusters of two elements is attempted.
"lihc"
the Local Improvement Heuristic with Chains method of Dhillon, Guan and Kogan (2002). This attempts further improvements via Kernighan-Lin object moves chains.

Method "pclust" is the only method available for soft spherical k-means problems. Method "genetic" can handle case weights. By default, the genetic algorithm is used for obtained hard partitions, and the fixed-point algorithm otherwise.

Control parameters for method "genetic" are as follows.

maxiter
an integer giving the maximum number of iterations for the genetic algorithm. Defaults to 12.
popsize
an integer giving the population size for the genetic algorithm. Default: 6.
mutations
a number between 0 and 1 giving the probability of mutation per iteration. Defaults to 0.1.
start
a list with the prototypes for the initial population.
reltol
The minimum relative improvement per iteration. If improvement is less, the algorithm will stop under the assumption that no further significant improvement can be made. Defaults to 1e-8.
verbose
a logical indicating whether to provide some output on minimization progress. Defaults to getOption("verbose").

See the documentation of pclust in package clue for the control parameters for method "pclust".

Control parameters for method "CLUTO" are as follows.

vcluster
the path to the CLUTO vcluster executable.
colmodel
a specification of the CLUTO column model. See the CLUTO documentation for more details.
verbose
as for the genetic algorithm.
control
a character string specifying arguments passed over to the vcluster executable.

Control parameters for the local improvement heuristics are as follows.

maxiter
an integer giving the maximal number of iterations to be performed. Defaults to 100.
start
a single prototype to be used as a starting value.
reltol
as for the genetic algorithm.
verbose
as for the genetic algorithm.

The enhanced chain-based heuristic has the following additional control parameter.

maxchains
an integer giving the maximal length of the Kernighan-Lin chains. Defaults to 10.

Method "CLUTO" requires that the CLUTO vcluster executable is available. CLUTO binaries for the Linux, Sun, OSX, and MS Windows platforms can be obtained from http://www-users.cs.umn.edu/~karypis/cluto/.

User-defined methods must have formals x, k and control, and optionally may have formals weights or m if providing support for case weights or soft spherical k-means partitions, respectively.

Value

An object of class pclust (see the information on pclust objects in package clue for further details) representing the obtained spherical k-means partition, which is a list with components including the following:

prototypes a dense matrix with k rows giving the prototypes.
membership cluster membership as a matrix with k columns.
cluster the class ids of the closest hard partition (the partition itself if m = 1).
value the value of the criterion.

Author(s)

Kurt Hornik Kurt.Hornik@wu.ac.at,
Ingo Feinerer feinerer@logic.at,
Martin Kober martin.kober@wu.ac.at.

References

I. S. Dhillon and D. S. Modha (2001). Concept decompositions for large sparse text data using clustering. Machine Learning, 42, 143–175.

I. S. Dhillon and Y. Guan and J. Kogan (2002). Iterative clustering of high dimensional text data augmented by local search. In Proceedings of the Second IEEE International Conference on Data Mining, pages 131–138. http://www.cs.utexas.edu/users/inderjit/public_papers/iterative_icdm02.pdf.

K. Krishna and M. Narasimha Murty (1999). Genetic K-means algorithm. IEEE Transactions on Systems, Man, and Cybernetics — Part B: Cybernetics, 29/3, 433–439. http://eprints.iisc.ernet.in/2937/1/genetic-k.pdf.

G. Karypis (2003). CLUTO: A Clustering Toolkit. Technical Report #02-017, Department of Computer Science, University of Minnesota. http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/manual.pdf.

Examples

## Use CLUTO dataset 're0':
x <- readCM(system.file("cluto", "re0.mat",
                        package = "skmeans"),
            system.file("cluto", "re0.mat.clabel",
                        package = "skmeans"))
## Which is not really small:
dim(x)
## Partition into 5 clusters.
party <- skmeans(x, 5, control = list(verbose = TRUE))
## Criterion value obtained:
party$value
## Compare with "true" classifications:
class_ids <-
    readLines(system.file("cluto", "re0.mat.rclass",
                          package = "skmeans"))
table(class_ids, party$cluster)

[Package skmeans version 0.1-4 Index]