pvclust {pvclust} | R Documentation |
calculates p-values for hierarchical clustering via multiscale bootstrap resampling. Hierarchical clustering is done for given data and p-values are computed for each of the clusters.
pvclust(data, method.hclust="average", method.dist="correlation", use.cor="pairwise.complete.obs", nboot=1000, r=seq(.5,1.4,by=.1), store=FALSE, weight=FALSE) parPvclust(cl, data, method.hclust="average", method.dist="correlation", use.cor="pairwise.complete.obs", nboot=1000, r=seq(.5,1.4,by=.1), store=FALSE, weight=FALSE, init.rand=TRUE, seed=NULL)
data |
numeric data matrix or data frame. |
method.hclust |
the agglomerative method used in hierarchical clustering. This
should be (an abbreviation of) one of "average" , "ward" ,
"single" , "complete" , "mcquitty" ,
"median" or "centroid" . The default is
"average" . See method argument in
hclust .
|
method.dist |
the distance measure to be used. This should be (an
abbreviation of) one of "correlation" , "uncentered" ,
"abscor" or those which are allowed for method
argument in dist function. The default is
"correlation" . See details section in this help and
method argument in dist .
|
use.cor |
character string which specifies the method for
computing correlation with data including missing values. This
should be (an abbreviation of) one of "all.obs" ,
"complete.obs" or "pairwise.complete.obs" . See
the use argument in cor function.
|
nboot |
the number of bootstrap replications. The default is
1000 . |
r |
numeric vector which specifies the relative sample sizes of bootstrap replications. For original sample size n and bootstrap sample size n', this is defined as r=n'/n. |
store |
locical. If store=TRUE , all bootstrap replications
are stored in the output object. The default is FALSE . |
cl |
snow cluster object which may be generated by
function makeCluster . See snow-startstop
in snow package. |
weight |
logical. If weight=TRUE , resampling is made by
weight vector instead of index vector. Useful for large r
value (r>10 ). Currently, available only for distance
"correlation" and "abscor" . |
init.rand |
logical. If init.rand=TRUE , random number
generators are initialized at child processes. Random seeds can be
set by seed argument. |
seed |
integer vector of random seeds. It should have the same
length as cl . If NULL is specified,
1:length(cl) is used as seed vector. The default is NULL . |
Function pvclust
conducts multiscale bootstrap resampling to calculate
p-values for each cluster in the result of hierarchical
clustering. parPvclust
is the parallel version of this
procedure which depends on snow package for parallel
computation.
For data expressed as (n, p) matrix or data frame, we assume that the data is n observations of p objects, which are to be clustered. The i'th row vector corresponds to the i'th observation of these objects and the j'th column vector corresponds to a sample of j'th object with size n.
There are several methods to measure the dissimilarities between
objects. For data matrix X,
"correlation"
method takes
1 - cor(X)[j,k]
for dissimilarity between j'th and k'th object, where
cor is function cor
.
"uncentered"
takes uncentered sample correlation
1 - sum(x[,j] * x[,k]) / (sqrt(sum(x[,j]^2)) * sqrt(sum(x[,k]^2)))
and "abscor"
takes the absolute value of sample correlation
1 - abs(cor(X)[j,k]).
hclust |
hierarchical clustering for original data generated by
function hclust . See hclust for details. |
edges |
data frame object which contains p-values and supporting informations such as standard errors. |
count |
data frame object which contains primitive information about the result of multiscale bootstrap resampling. |
msfit |
list whose elements are results of curve fitting for
multiscale bootstrap resampling, of class msfit . See
msfit for details. |
nboot |
numeric vector of number of bootstrap replications. |
r |
numeric vector of the relative sample size for bootstrap replications. |
store |
list contains bootstrap replications if store=TRUE
was given for function pvclust or parPvclust . |
Ryota Suzuki ryota.suzuki@is.titech.ac.jp
Shimodaira, H. (2004) "Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling", Annals of Statistics, 32, 2616-2641.
Shimodaira, H. (2002) "An approximately unbiased test of phylogenetic tree selection", Systematic Biology, 51, 492-508.
Suzuki, R. and Shimodaira, H. (2004) "An application of multiscale bootstrap resampling to hierarchical clustering of microarray data: How accurate are these clusters?", The Fifteenth International Conference on Genome Informatics 2004, P034.
http://www.is.titech.ac.jp/~shimo/prog/pvclust/
lines.pvclust
, print.pvclust
,
msfit
, plot.pvclust
,
text.pvclust
, pvrect
and
pvpick
.
## using Boston data in package MASS library(MASS) data(Boston) ## multiscale bootstrap resampling boston.pv <- pvclust(Boston, nboot=100) ## CAUTION: nboot=100 may be too small for actual use. ## We suggest nboot=1000 or larger. ## plot/print functions will be useful for diagnostics. ## plot dendrogram with p-values plot(boston.pv) ask.bak <- par()$ask par(ask=TRUE) ## highlight clusters with high au p-values pvrect(boston.pv) ## print the result of multiscale bootstrap resampling print(boston.pv, digits=3) ## plot diagnostic for curve fitting msplot(boston.pv, edges=c(2,4,6,7)) par(ask=ask.bak) ## Print clusters with high p-values boston.pp <- pvpick(boston.pv) boston.pp ## Not run: ## parallel computation via snow package library(snow) cl <- makeCluster(10, type="MPI") ## parallel version of pvclust boston.pv <- parPvclust(cl, Boston, nboot=1000) ## End(Not run)