bclust {bclust}R Documentation

Bayesian agglomerative clustering for high dimensional data with variable selection.

Description

The function clusters data saved in a matrix using an additive linear model with disappearing random effects. The model has built-in spike-and-slab components which quantifies important variables for clustering and can be extracted using the imp function.

Usage

bclust(x,rep.id=1:nrow(x),effect.family="gaussian",
var.select=TRUE,transformed.par,labels=NULL)

Arguments

x A numeric matrix, with clustering individuals in rows and the variables in columns.
rep.id A vector consisting of positive integer elements having the same length as the number of rows of x. This vector identifies replicates of a clustering type such that the total number of clustering types is max(rep.id). If nothing is declared the function presupposes that the data are unreplicated, that is each row of x is a clustering type.
effect.family Distribution family of the disappearing random components. Choices are "gaussian" or "alaplace" allowing Gaussian or asymmetric Laplace family, respectively.
var.select A logical value, TRUE for fitting models that define spike-and-slab distribution in variable level and allows Bayesian variable selection.
transformed.par The transformed model parameters in a vector. The length of the vector depends on the chosen model and the availability of variable selection. The log transformation is supposed to be applied for the variance parameters, the identity for the mean, and the logit for the proportions. The function loglikelihood can be used to estimate them from the data.
labels A vector of strings referring to the labels of clustering types. The length of the vector should match to max(rep.id). The first element corresponds to the label of the type having the smallest integer value in rep.id, the second element refers to the label of the type having the second smallest integer in rep.id, and so on.

Details

The function calls internal C functions depending on the chosen model. The C-stack of the system may overflow if you have a large dataset. You may need to adjust the stack before running R using your operation system command line. If you use Linux, open a console and type >ulimit -s unlimited, then run R in the same console. The Microsoft Windows users don't need to increase the stack size.

We assumed a Bayesian linear model for clustering being

y_{vctr}=m+h_{vct}+d_{v}*g_{vc}*t_{vc}+e_{vctr}

where y_{vctr} is the available data on variable v, cluster c, clustering type t, and replicate r; h_{vct} is the between-type error, t_{vc} is the disappearing random component controlled by the Bernoulli variables d_{v} with success probability q and g_{vc} with success probability p; and e_{vctr} is the between-replicate error. The types inside a cluster share the same t_{vc}, but may arise with a different h_{vct}, for more details see the package website http://bclust.probstat.ch and documents there in.

Value

data The data matrix, reordered according to rep.id.
repno The number of replicates of the values of rep.id
merge The merge matrix, in hclust object format.
height A monotone vector referring to the height of the constructed tree.
logposterior The log posterior for each merge.
clust.number The number of clusters for each merge.
cut The value of the height corresponding to the maximum of the log posterior in agglomerative path.
transformed.par The transformed values of the model parameters. The log transformation is applied for the variance parameters, the identity for the mean, and the logit for the proportions.
labels The labels associated to each clustering type.
effect.family The distribution assigned to the disappearing random effect in the function arguments.
var.select The variable selection chosen in the function arguments.

See Also

loglikelihood, meancss, imp.

Examples

data(gaelle)

# unreplicated clustering
gaelle.bclust<-bclust(x=gaelle,transformed.par=c(-1.84,-0.99,1.63,0.08,-0.16,-1.68)) 
par(mfrow=c(2,1))
plot(as.dendrogram(gaelle.bclust))
abline(h=gaelle.bclust$cut)
plot(gaelle.bclust$clust.number,gaelle.bclust$logposterior,xlab="Number of clusters",ylab="Log posterior",type="b")
abline(h=max(gaelle.bclust$logposterior))

#replicated clustering
gaelle.id<-rep(1:14,c(3,rep(4,13))) # first 3 rows replication of ColWT , the other mutants each 
gaelle.lab<-c("ColWT","d172","d263","isa2",
"sex4","dpe2","mex1","sex3","pgm","sex1","WsWT","tpt","RLDWT","ke103")

gaelle.bclust<-bclust(gaelle,rep.id=gaelle.id,labels=gaelle.lab,transformed.par=c(-1.84,-0.99,1.63,0.08,-0.16,-1.68))
plot(as.dendrogram(gaelle.bclust))
abline(h=gaelle.bclust$cut)
plot(gaelle.bclust$clust.number,gaelle.bclust$logposterior,xlab="Number of clusters",ylab="Log posterior",type="b")
abline(h=max(gaelle.bclust$logposterior))

[Package bclust version 1.1 Index]