cluster.test {bayesclust}R Documentation

Compute Posterior Probabilities for Dataset

Description

cluster.test computes the empirical posterior probability (EPP) of the null hypothesis in the following test:
H_0 : No clusters
H_1 : k clusters

where k takes an integer value strictly greater than 1.

Usage

cluster.test(data, nsim = 1000, aR = 0.4, p = 2, k = 2, 
        a = 2.01, b = 0.990099, tau2 = 1, mcs=0.2, file = "", label = "data")

Arguments

data data should be a matrix, even when the response is univariate. The number of rows should equal the number of observations, and the number of columns should equal to p.
nsim As the Bayes Factor (BF) for the hypothesis is computed with the aid of a Metropolis-Hastings (MH) MCMC algorithm, the number of simulations has to be specified. It is recommended that nsim be at least 500,000, and that replications be carried out in order to monitor convergence.
aR The candidate distribution in the MH algorithm is a mixture of g and a random walk. aR, which must be a value between 0 and 1, specifies the percentage of time that the random walk is chosen. Please see the references below for further details on g.
p The observations are assumed to come from a multivariate normal distribution, of length p.
k k must take an integer value strictly greater than 1. It specifies the alternative hypothesis in the test.
a a is a hyperparameter for the prior on σ^2. Further details can be found in the references below.
b Like a, b is also a hyperparameter for the prior on σ^2. Further details can be found in the references below.
tau2 tau2 is a hyperparameter for the prior on the mean μ for each cluster.
mcs mcs stands for Minimum Cluster Size. It should be a value between 0 and 1. It instructs the test procedure to only consider clusters of a certain minimum size.
file This argument is a character string. If specified, the output object will be saved to this (binary) file. It can be loaded, inspected and alterered later in subsequent R sessions using load. If left unspecified, the object will not be saved to a file and could be lost on quitting the R session.
label label serves to name the dataset in any given hypothesis test.

Details

Since the hypothesis test is carried out in a Bayesian framework, the Bayes Factor has to be calculated. As this is an integral over a huge space, the sum is estimated using MCMC. Certain portions of cluster.test have been coded in C in order to speed up the simulations.

Value

The output from this function is a list object consisting of three components. It will be assigned S3 class ``cluster.test'', and can then serve as input to plot or emp2pval. The components of this list object are:

param This component exists purely for bookkeeping purposes, in the sense that these parameters, under which the test was run, will be checked against the parameters used to generate the null distribution of the test statistic. By default, conversion of the empirical posterior probability to a p-value will only proceed if the parameters match. The user does have the option of ignoring this check though. See emp2pval for further details.
iterations This is a vector of indices that will be used to plot the running posterior probabilities when plot is called. Since it is superfluous to keep the entire chain, the concept of 'thinning' is applied - only every 500th iteration is stored. Counting begins backwards from the most recent iteration.
ClusterStat This is a 'thinned' vector of running posterior probabilities. The values that are kept correspond exactly to those in the preceding iterations component. When summary.cluster.test is called, the final posterior probability is extracted and printed in a readable format.

Author(s)

Gopal, V.

References

Fuentes, C. and Casella, G. (2008) "Testing for the Existence of Clusters" http://www.stat.ufl.edu/~casella/Papers/paper-v3.pdf

Gopal, V. "BayesClust User Manual" http://www.stat.ufl.edu/~viknesh/bayesclust/clust.html

See Also

plot.cluster.test to monitor convergence of computation of posterior probability.

summary.cluster.test to display the computed final posterior probabilities for each dataset run.

Examples

# Generate random 2-variate data
Y <- matrix(rnorm(24), nrow=12)

# Search for optimal partitioning of data into 2 clusters
test1 <- cluster.test(Y, p=2)

# Plot the running posterior probabilities to monitor convergence
plot(test1)

# Generate corresponding null density object.
null1 <- nulldensity(nsim=100, n=12, p=2, k=2)

# Convert EPP to p-value
emp2pval(test1, null1)

[Package bayesclust version 2.1 Index]