clValid {clValid} | R Documentation |
clValid
reports validation measures for clustering
results. The function returns an object of class
"clValid"
, which
contains the clustering results in addition to the validation
measures. The validation measures fall into three general categories:
"internal", "stability", and "biological".
clValid(obj, nClust, clMethods = "hierarchical", validation = "stability", maxitems = 600, metric = "euclidean", method = "average", neighbSize = 10, annotation = "entrezgene", GOcategory = "all", goTermFreq=0.05, ...)
obj |
Either a numeric matrix, a data frame, or an ExpressionSet
object. Data frames must contain all numeric columns. In all
cases, the rows are the items to be clustered (e.g., genes),
and the columns are the samples. |
nClust |
A numeric vector giving the numbers of clusters to be evaluated. e.g., 4:6 would evaluate the number of clusters ranging from 4 to 6. |
clMethods |
A character vector giving the clustering methods. Available options are "hierarchical", "kmeans", "diana", "fanny", "som", "model", "sota", "pam", "clara", and "agnes", with multiple choices allowed. |
validation |
A character vector giving the type of validation measures to use. Available options are "internal", "stability", and "biological", with multiple choices allowed. |
maxitems |
The maximum number of items (rows in matrix) which can be clustered. |
metric |
The metric used to determine the distance matrix. Possible choices are "euclidean", "correlation", and "manhattan". |
method |
For hierarchical clustering (hclust and agnes ), the
agglomeration method used. Available choices are "ward", "single",
"complete", and "average". |
neighbSize |
For internal validation, an integer giving the neighborhood size used for the connectivity measure. |
annotation |
For biological validation, either a character string naming the Bioconductor annotation package for mapping genes to GO categories, or a list with the names of the functional classes and the observations belonging to each class. |
GOcategory |
For biological validation, gives which GO categories to use for biological validation. Can be one of "BP", "MF", "CC", or "all". |
goTermFreq |
For the BSI, what threshold frequency of GO terms to use for functional annotation. |
... |
Additional arguments to pass to the clustering functions. |
This function calculates validation measures for a given set of clustering algorithms and number of clusters. A variety of clustering algorithms are available, including hierarchical, self-organizing maps (SOM), K-means, self-organizing tree algorithm (SOTA), and model-based. The available validation measures fall into the three general categories of "internal", "stability", and "biological". A brief description of each measure is given below, for further details refer to the package vignette and the references.
neighbSize
argument specifies the number of neighbors to use.
The connectivity has a value between 0 and infinity and should be minimized.
Both the Silhouette Width and the Dunn Index combine measures of
compactness and separation of the clusters. The Silhouette Width is
the average of each observation's Silhouette value. The Silhouette
value measures the degree of confidence in a particular clustering
assignment and lies
in the interval [-1,1], with well-clustered observations having values
near 1 and poorly clustered observations having values near -1. See
the silhouette
function in package cluster for
more details. The
Dunn Index is the ratio between the smallest distance between
observations not in the same cluster to the largest intra-cluster
distance. It has a value between 0 and infinity and should be maximized.For biological validation, the user has two options. The first option is to explicity specify the functional clustering of the genes via a named list. Each item in the list corresponds to a functional class, and contains a list of genes which are associated with that function. The second option is to specify the appropriate annotation package from Bioconductor (http://www.bioconductor.org) and GO terms to determine the functional classes of the genes. To use the second option requires the Biobase, annotate, and GO packages from Bioconductor, in addition to the annotation package for the particular data type (these will not be automatically loaded when clValid is loaded).
The GOcategory
options are "MF", "BP", "CC", or "all",
corresponding to molecular function, biological process, cellular
component, and all of the ontologies.
clValid
returns an object of class
"clValid"
. See the help file for the class description.
Unless the the list of genes corresponding to functional classes is prespecified, to perform biological clustering validation will require the Biobase, annotate and GO packages from Bioconductor, and in addition the annotation package for your particular data type. Please see http://www.bioconductor.org for installation instructions.
Further details of the validation measures and instructions in use can be found in the package vignette.
Guy Brock, Vasyl Pihur, Susmita Datta, Somnath Datta
Datta, S. and Datta, S. (2003). Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics 19(4): 459-466.
Datta, S. and Datta, S. (2006). Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes. BMC Bioinformatics 7:397.
Handl, J., Knowles, K., and Kell, D. (2005). Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15): 3201-3212.
For a description of the class 'clValid' and all available methods see clValidObj
or clValid-class
.
For help on the clustering methods see hclust
and
kmeans
in package stats,
agnes
, clara
, diana
,
fanny
, and pam
in package cluster,
som
in package kohonen, Mclust
in package mclust, and sota
(in this package).
For additional help on the validation measures see
connectivity
, dunn
,
stability
,
BHI
, and
BSI
.
data(mouse) ## internal validation express <- mouse[1:25,c("M1","M2","M3","NC1","NC2","NC3")] rownames(express) <- mouse$ID[1:25] intern <- clValid(express, 2:6, clMethods=c("hierarchical","fanny","model"), validation="internal") ## view results summary(intern) optimalScores(intern) plot(intern) ## stability measures stab <- clValid(express, 2:6, clMethods=c("hierarchical","fanny","model"), validation="stability") optimalScores(stab) plot(stab) ## biological measures ## first way - functional classes predetermined fc <- tapply(rownames(express),mouse$FC[1:25], c) fc <- fc[-match( c("EST","Unknown"), names(fc))] bio <- clValid(express, 2:6, clMethods=c("hierarchical","fanny","model"), validation="biological", annotation=fc) optimalScores(bio) plot(bio) ## second way - using Bioconductor if(require("Biobase") && require("annotate") && require("GO") && require("moe430a")) { bio2 <- clValid(express, 2:6, clMethods=c("hierarchical","fanny","model"), validation="biological",annotation="moe430a",GOcategory="all") optimalScores(bio2) plot(bio2) }