HierarchicalSparseCluster {sparcl}R Documentation

Hierarchical sparse clustering

Description

Performs sparse hierarchical clustering. If $d_ii'j$ is the dissimilarity between observations i and i' for feature j, seek a sparse weight vector w and then use $(sum_j (d_ii'j w_j))_ii'$ as a nxn dissimilarity matrix for hierarchical clustering.

Usage

HierarchicalSparseCluster(x=NULL, dists=NULL,method=c("average","complete", "single","centroid"),
wbound=NULL,niter=15,dissimilarity=c("squared.distance","absolute.value"), uorth=NULL,silent=FALSE,
cluster.features=FALSE,method.features=c("average", "complete",
"single","centroid"),output.cluster.files=FALSE,
outputfile.prefix="output",genenames=NULL,genedesc=NULL,standardize.arrays=FALSE)

Arguments

x A nxp data matrix; n is the number of observations and p the number of features. If NULL, then specify dists instead.
dists For advanced users, can be entered instead of x. If HierarchicalSparseCluster has already been run on this data, then the dists value of the previous output can be entered here. Under normal circumstances, leave this argument NULL and pass in x instead.
method The type of linkage to use in the hierarchical clustering - "single", "complete", "centroid", or "average".
wbound The L1 bound on w to use; this is the tuning parameter for sparse hierarchical clustering.
niter The number of iterations to perform in the sparse hierarchical clustering algorithm.
dissimilarity The type of dissimilarity measure to use. One of "squared.distance" or "absolute.value". Only use this if x was passed in (rather than dists).
uorth If complementary sparse clustering is desired, then this is the nxn dissimilarity matrix obtained in the original sparse clustering.
standardize.arrays Should the arrays be standardized? Default is FALSE.
silent Print out progress?
cluster.features Not for use.
method.features Not for use.
output.cluster.files Not for use.
outputfile.prefix Not for use.
genenames Not for use.
genedesc Not for use.

Details

We seek a p-vector of weights w (one per feature) and a nxn matrix U that optimize

$maximize_U,w sum_j w_j sum_ii' d_ii'j U_ii'$ subject to $||w||_2 <= 1, ||w||_1 <= wbound, w_j >= 0, sum_ii' U_ii'^2 <= 1$.

Here, $d_ii'j$ is the dissimilarity between observations i and i' with along feature j. The resulting matrix U is used as a dissimilarity matrix for hierarchical clustering. "wbound" is a tuning parameter for this method, which controls the L1 bound on w, and as a result the number of features with non-zero $w_j$ weights. The non-zero elements of w indicate features that are used in the sparse clustering.

We optimize the above criterion with an iterative approach: hold U fixed and optimize with respect to w. Then, hold w fixed and optimize with respect to U.

Note that the arguments described as "Not for use" are included for the sparcl package to function with GenePattern but should be ignored by the R user.

Value

hc The output of a call to "hclust", giving the results of hierarchical sparse clustering.
ws The p-vector of feature weights.
u The nxn dissimilarity matrix passed into hclust, of the form $(sum_j w_j d_ii'j)_ii'$.
dists The (n*n)xp dissimilarity matrix for the data matrix x. This is useful if additional calls to HierarchicalSparseCluster will be made.

Author(s)

Daniela M. Witten and Robert Tibshirani

References

Witten and Tibshirani (2009) A framework for feature selection in clustering.

See Also

HierarchicalSparseCluster.permute,KMeansSparseCluster,KMeansSparseCluster.permute

Examples

  # Generate 2-class data
  set.seed(1)
  x <- matrix(rnorm(100*200),ncol=200)
  y <- c(rep(1,50),rep(2,50))
  x[y==1,1:25] <- x[y==1,1:25]+2
  # Do tuning parameter selection for sparse hierarchical clustering
  perm.out <- HierarchicalSparseCluster.permute(x, wbounds=c(1.5,2:9),
nperms=5)
  print(perm.out)
  plot(perm.out)
  # Perform sparse hierarchical clustering
  sparsehc <- HierarchicalSparseCluster(dists=perm.out$dists,
wbound=perm.out$bestw, method="complete")
  # faster than   sparsehc <- HierarchicalSparseCluster(x=x,wbound=perm.out$bestw, method="complete")
  par(mfrow=c(1,2))
  plot(sparsehc)
  plot(sparsehc$hc, labels=rep("", length(y)))
  print(sparsehc)
  # Plot using knowledge of class labels in order to compare true class
  #   labels to clustering obtained
  par(mfrow=c(1,1))
  ColorDendrogram(sparsehc$hc,y=y,main="My Simulated Data",branchlength=.007)
  # Now, what if we want to see if out data contains a *secondary*
  #   clustering after accounting for the first one obtained. We
  #   look for a complementary sparse clustering:
  sparsehc.comp <- HierarchicalSparseCluster(x,wbound=perm.out$bestw,
     method="complete",uorth=sparsehc$u)
  # Redo the analysis, but this time use "absolute value" dissimilarity:
  perm.out <- HierarchicalSparseCluster.permute(x, wbounds=c(1.5,2:9),
    nperms=5, dissimilarity="absolute.value")
  print(perm.out)
  plot(perm.out)
  # Perform sparse hierarchical clustering
  sparsehc <- HierarchicalSparseCluster(dists=perm.out$dists, wbound=perm.out$bestw, method="complete", dissimilarity="absolute.value")
  par(mfrow=c(1,2))
  plot(sparsehc)

[Package sparcl version 1.0 Index]