cluster.optimal {bayesclust} | R Documentation |
cluster.optimal
will search for the optimal
k-clustering of the dataset.
cluster.optimal(data, nsim = 1000, aR = 0.4, p = 2, k = 2, a = 2.01, b = 0.990099, tau2 = 1, keep = 4, mcs = 0.2, file = "", label = "data")
data |
data should be a matrix, even when the response is
univariate. The number of rows should equal the number of observations,
and the number of columns should equal to p . |
nsim |
The algorithm is based on a stochastic search through the
partition space, and nsim corresponds to the number of points in the
partition space to inspect. It is recommended that nsim be at least
500,000. |
aR |
The Metropolis search algorithm samples points from the space of
partitions according to a mixture of g and a
random walk. aR , which must be a value between 0 and 1, specifies the
percentage of time that the random walk is chosen. Please see the references
below for further details on the density g . |
p |
The observations are assumed to come from a multivariate normal
distribution, of length p . |
k |
k must take an integer value strictly greater than 1. It instructs the algorithm
to search for the optimal partition of the data into k clusters. |
a |
a is a hyperparameter for the prior on σ^2.
Further details can be found in the references below. |
b |
Like a , b is also a hyperparameter for the prior on σ^2.
Further details can be found in the references below. |
tau2 |
tau2 is a hyperparameter for the prior on the mean μ for
each cluster. |
keep |
This argument instructs the algorithm to store the top keep number of
clusters that it finds during it's run. By default, the best 4 clusters that are found will
be kept. |
mcs |
mcs stands for Minimum Cluster Size. It should be a value between 0
and 1. It instructs the algorithm to only consider clusters of a certain minimum
size. |
file |
This argument is a character string. If specified, the output object will
be saved to this (binary) file. It can be loaded, inspected and altered later in
subsequent R sessions using load . If left unspecified, the object will not
be saved to a file and could be lost on quitting the R session. |
label |
label serves to name the dataset in any given
hypothesis test. |
A Metropolis search algorithm is run to maximise the marginal of Y, that is, m(Y | omega_k) where omega_k is a particular partitioning of the data into k clusters. The partitions that yield the highest marginal will be deemed to be optimal.
Since the sample space is so large, the algorithm is started at an intelligent starting point, by running kmeans, but with a crude imposition of the minimum cluster size.
The object returned is a list consisting of 2 components.
param |
The purpose of this component is to store the parameters under which the algorithm was run. |
data |
This component will
contain a table of the best keep clusters and the computed value of the
marginal corresponding to those clusters. The latter values are kept as a means of assessing
the relative merit of the clusters in the table. |
Gopal, V.
Fuentes, C. and Casella, G. (2008) "Testing for the Existence of Clusters" http://www.stat.ufl.edu/~casella/Papers/paper-v3.pdf
Gopal, V. "BayesClust User Manual" http://www.stat.ufl.edu/~viknesh/bayesclust/clust.html
plot.cluster.optimal
to plot the clustered data.
# Generate random 2-variate data Y <- matrix(rnorm(24), nrow=12) # Search for optimal partitioning of data into 2 clusters search1 <- cluster.optimal(Y, p=2, keep=5) # Plot the best cluster found during search plot(search1)