cluster.test {bayesclust} | R Documentation |
cluster.test
computes the empirical posterior probability (EPP) of the
null hypothesis in the following test:
H_0 : No clusters |
H_1 : k clusters |
where k takes an integer value strictly greater than 1.
cluster.test(data, nsim = 1000, aR = 0.4, p = 2, k = 2, a = 2.01, b = 0.990099, tau2 = 1, mcs=0.2, file = "", label = "data")
data |
data should be a matrix, even when the response is
univariate. The number of rows should equal the number of observations,
and the number of columns should equal to p . |
nsim |
As the Bayes Factor (BF) for the hypothesis is computed
with the aid of a Metropolis-Hastings (MH) MCMC algorithm, the number of
simulations has to be specified. It is recommended that nsim
be at least 500,000, and that replications be carried out in order to
monitor convergence. |
aR |
The candidate distribution in the MH algorithm is
a mixture of g and a random walk. aR , which must be a value
between 0 and 1, specifies the percentage of time that the random walk is chosen.
Please see the references below for further details on g. |
p |
The observations are assumed to come from a multivariate normal
distribution, of length p . |
k |
k must take an integer value strictly greater than 1. It specifies
the alternative hypothesis in the test. |
a |
a is a hyperparameter for the prior on σ^2.
Further details can be found in the references below. |
b |
Like a , b is also a hyperparameter for the prior on σ^2.
Further details can be found in the references below. |
tau2 |
tau2 is a hyperparameter for the prior on the mean μ for
each cluster. |
mcs |
mcs stands for Minimum Cluster Size. It should be a value between 0
and 1. It instructs the test procedure to only consider clusters of a certain minimum
size. |
file |
This argument is a character string. If specified, the output object will
be saved to this (binary) file. It can be loaded, inspected and alterered later in
subsequent R sessions using load . If left unspecified, the object will not
be saved to a file and could be lost on quitting the R session. |
label |
label serves to name the dataset in any given
hypothesis test. |
Since the hypothesis test is carried out in a Bayesian framework, the
Bayes Factor has to be calculated. As this is an integral over a huge space, the sum
is estimated using MCMC. Certain
portions of cluster.test
have been coded in C in order to speed up the simulations.
The output from this function is a list object consisting of three components. It will
be assigned S3 class ``cluster.test'', and can then serve as input to plot
or emp2pval
. The components of this list object are:
param |
This component exists purely for bookkeeping purposes, in
the sense that these parameters, under which the test was run, will be checked against
the parameters used to generate the null distribution of the test statistic. By default,
conversion of the empirical posterior probability to a p-value will only proceed if the
parameters match. The user does have the option of ignoring this check though. See
emp2pval for further details. |
iterations |
This is a vector of indices that will be used to plot the
running posterior probabilities when plot is
called. Since it is superfluous to keep the entire chain, the concept of 'thinning'
is applied - only every 500th iteration is stored. Counting begins backwards from
the most recent iteration. |
ClusterStat |
This is a 'thinned' vector of running posterior probabilities.
The values that are kept correspond
exactly to those in the preceding iterations component. When
summary.cluster.test is called, the final posterior probability is extracted and
printed in a readable format. |
Gopal, V.
Fuentes, C. and Casella, G. (2008) "Testing for the Existence of Clusters" http://www.stat.ufl.edu/~casella/Papers/paper-v3.pdf
Gopal, V. "BayesClust User Manual" http://www.stat.ufl.edu/~viknesh/bayesclust/clust.html
plot.cluster.test
to monitor convergence of computation of
posterior probability.
summary.cluster.test
to display the computed final posterior probabilities
for each dataset run.
# Generate random 2-variate data Y <- matrix(rnorm(24), nrow=12) # Search for optimal partitioning of data into 2 clusters test1 <- cluster.test(Y, p=2) # Plot the running posterior probabilities to monitor convergence plot(test1) # Generate corresponding null density object. null1 <- nulldensity(nsim=100, n=12, p=2, k=2) # Convert EPP to p-value emp2pval(test1, null1)