gapStat {SLmisc}R Documentation

Gap statistic for estimating the number of data clusters

Description

Calculates a goodness of clustering measure based on the average dispersion compared to a reference distribution.

Usage

gapStat(data, class = rep(1, nrow(data)), M = 500)

Arguments

data matrix or data.frame, data
class a vector describing the cluster memberships of the rows of data
M integer, number of Monte Carlo samples

Details

This function is based on the function gap of package "SAGx".

Value

vector with components "gap statistic" and "SE of simulation".

Author(s)

Dr. Matthias Kohl (SIRS-Lab GmbH) kohl@sirs-lab.com

References

T. Hastie, R. Tibshirani and G. Walther (2001). Estimating the number of data clusters via the Gap statistic. J.R. Statist. Soc. B, 63, pp. 411–423.

Tibshirani, R., Walther, G. and Hastie, T. (2000). Estimating the number of clusters in a dataset via the Gap statistic. Technical Report. Stanford.

Per Broberg (2006). SAGx: Statistical Analysis of the GeneChip. R package version 1.9.7. http://home.swipnet.se/pibroberg/expression_hemsida1.html

See Also

gap, kmeansGap

Examples

x <- rbind(matrix(rnorm(150, sd = 0.1), ncol= 3),
              matrix(rnorm(150, mean = 1, sd = 0.1), ncol = 3),
              matrix(rnorm(150, mean = 2, sd = 0.1), ncol = 3),
              matrix(rnorm(150, mean = 3, sd = 0.1), ncol = 3))

gap.stat <- matrix(NA, ncol = 2, nrow = 9)
for(i in 2:10){
  cl <- kmeans(x, i)
  gap.stat[i-1, ] <- gapStat(x, cl$clust, M = 100)
}

## choose cluster size to be the smallest value such that the following 
## is positive
(res <- gap.stat[1:8,1] - gap.stat[2:9,1] + gap.stat[2:9,2])
min(c(2:9)[res >= 0])

[Package SLmisc version 1.4.1 Index]