cluster.optimal {bayesclust}R Documentation

Search for Optimal Clustering of Dataset

Description

cluster.optimal will search for the optimal k-clustering of the dataset.

Usage

cluster.optimal(data, nsim = 1000, aR = 0.4, p = 2, k = 2,
        a = 2.01, b = 0.990099, tau2 = 1, keep = 4, mcs = 0.2,
        file = "", label = "data")

Arguments

data data should be a matrix, even when the response is univariate. The number of rows should equal the number of observations, and the number of columns should equal to p.
nsim The algorithm is based on a stochastic search through the partition space, and nsim corresponds to the number of points in the partition space to inspect. It is recommended that nsim be at least 500,000.
aR The Metropolis search algorithm samples points from the space of partitions according to a mixture of g and a random walk. aR, which must be a value between 0 and 1, specifies the percentage of time that the random walk is chosen. Please see the references below for further details on the density g.
p The observations are assumed to come from a multivariate normal distribution, of length p.
k k must take an integer value strictly greater than 1. It instructs the algorithm to search for the optimal partition of the data into k clusters.
a a is a hyperparameter for the prior on σ^2. Further details can be found in the references below.
b Like a, b is also a hyperparameter for the prior on σ^2. Further details can be found in the references below.
tau2 tau2 is a hyperparameter for the prior on the mean μ for each cluster.
keep This argument instructs the algorithm to store the top keep number of clusters that it finds during it's run. By default, the best 4 clusters that are found will be kept.
mcs mcs stands for Minimum Cluster Size. It should be a value between 0 and 1. It instructs the algorithm to only consider clusters of a certain minimum size.
file This argument is a character string. If specified, the output object will be saved to this (binary) file. It can be loaded, inspected and altered later in subsequent R sessions using load. If left unspecified, the object will not be saved to a file and could be lost on quitting the R session.
label label serves to name the dataset in any given hypothesis test.

Details

A Metropolis search algorithm is run to maximise the marginal of Y, that is, m(Y | omega_k) where omega_k is a particular partitioning of the data into k clusters. The partitions that yield the highest marginal will be deemed to be optimal.

Since the sample space is so large, the algorithm is started at an intelligent starting point, by running kmeans, but with a crude imposition of the minimum cluster size.

Value

The object returned is a list consisting of 2 components.

param The purpose of this component is to store the parameters under which the algorithm was run.
data This component will contain a table of the best keep clusters and the computed value of the marginal corresponding to those clusters. The latter values are kept as a means of assessing the relative merit of the clusters in the table.

Author(s)

Gopal, V.

References

Fuentes, C. and Casella, G. (2008) "Testing for the Existence of Clusters" http://www.stat.ufl.edu/~casella/Papers/paper-v3.pdf

Gopal, V. "BayesClust User Manual" http://www.stat.ufl.edu/~viknesh/bayesclust/clust.html

See Also

plot.cluster.optimal to plot the clustered data.

Examples

# Generate random 2-variate data
Y <- matrix(rnorm(24), nrow=12)

# Search for optimal partitioning of data into 2 clusters
search1 <- cluster.optimal(Y, p=2, keep=5)

# Plot the best cluster found during search
plot(search1)

[Package bayesclust version 2.1 Index]