cascadeKM {vegan}R Documentation

K-means partitioning using a range of values of K

Description

This function is a wrapper for the kmeans function. It creates several partitions forming a cascade from a small to a large number of groups.

Usage

cascadeKM(data, inf.gr, sup.gr, iter = 100, criterion = "calinski")
cIndexKM (y, x, index = "all")
## S3 method for class 'cascadeKM':
plot(x, min.g, max.g, grpmts.plot = TRUE, 
    sortg = FALSE, gridcol = NA, ...) 

Arguments

data The data matrix. The objects are the rows.
inf.gr The number of groups for the partition with the smallest number of groups of the cascade (min).
sup.gr The number of groups for the partition with the largest number of groups of the cascade (max).
iter The number of random starting configurations for each value of K.
criterion The criterion that will be used to select the best partition. The default value is "calinski", which refers to the Calinski-Harabasz (1974) criterion. The simple structure index, "ssi", is also available.
y Object of class "kmeans" returned by a clustering algorithm such as kmeans
x Data matrix where columns correspond to variables and rows to observations, or the plotting object in plot
index The available indices are: "calinski" and "ssi". Type "all" to obtain both indices. Abbreviations of these names are also accepted.
min.g, max.g The minimum and maximum numbers of groups to be displayed.
grpmts.plot Show the plot (TRUE or FALSE).
sortg Sort the objects as a function of their group membership to produce a more easily interpretatable graph. See Details. The original object names are kept; they are used as labels in the output table x, although not in the graph. If there were no rownames, sequential row numbers are used to keep track of the original order of the objects.
gridcol The colour of the grid lines in the plots. NA, which is the default value, removes the grid lines.
... Other parameters to the functins (ignored).

Details

The function creates several partitions formimg a cascade from a small to a large number of groups formed by kmeans. The most of the work is performed by function cIndex s based on the clustIndex function. Some of the criteria were removed from this version because computation errors were generated when only one object was found in a group.

The default value is "calinski", which refers to the well-known Calinski-Harabasz (1974) criterion. The other available index is the simple structure index "ssi". In the case of groups of equal sizes, "calinski" is generally a good criterion to indicate the correct number of groups. Users should not take its indications literally when the groups are not equal in size. Type "all" to obtain both indices. The indices are defined as:

  • calinski: (SSB/(K-1))/(SSW/(n-K)), where n is the number of data points and K is the number of clusters. SSW is the sum of squares within the clusters while SSB is the sum of squares among the clusters. This index is simply an F analysis-of-variance statistic.
  • ssi: this "Simple Structure Index" combines three elements which influence the interpretability of a solution, i.e., the maximum difference of each variable between the clusters, the sizes of the most contrasting clusters and the deviation of a variable in the cluster centers compared to its overall mean. These three elements are multiplicatively combined and normalized to give a value between 0 and 1.
  • In a simulation study, Milligan and Cooper (1985) found that the Calinski-Harabasz criterion recovered the correct number of groups the most often. We recommend that criterion because, if the groups are of equal sizes, the maximum value of "calinski" usually indicates the correct number of groups. Another available index is the simple structure index "ssi". Users should not take the indications of these indices literally when the groups are not equal in size and explore the groups corresponding to other values of K.

    Function cascadeKM has a plot method. Two plots are produced. The graph on the left has the objects in abscissa and the number of groups in ordinate. The groups are represented by colours. The graph on the right shows the values of the criterion ("calinski" or "ssi") for determining the best partition. The highest value of the criterion is marked in red. Points marked in orange, if any, indicate partitions producing an increase in the criterion value as the number of groups increases; they may represent other interesting partitions.

    If sortg=TRUE, the objects are reordered by the following procedure: (1) a simple matching distance matrix is computed among the objects, based on the table of K-means assignments to groups, from K = min.g to K = max.g. (2) A principal coordinate analysis (PCoA, Gower 1966) is computed on the centred distance matrix. (3) The first principal coordinate is used as the new order of the objects in the graph. A simplified algorithm is used to compute the first principal coordinate only, using the iterative algorithm described in Legendre & Legendre (1998, Table 9.10). The full distance matrix among objects is never computed; this evacuates the problem of storing it when the number of objects is large. Distance values are computed as they are needed by the algorithm.

    Value

    Function cascadeKM returns an object of class cascadeKM with items:

    partition Table with the partitions found for different numbers of groups K, from K = inf.gr to K = sup.gr.
    results Values of the criterion to select the best partition.
    criterion The name of the criterion used.
    size The number of objects found in each group, for all partitions (columns).


    Function cIndex returns a vector with the index values. The maximum value of these indices is supposed to indicate the best partition. These indices work best with groups of equal sizes. When the groups are not of equal sizes, one should not put too much faith in the maximum of these indices, and also explore the groups corresponding to other values of K.

    Author(s)

    Marie-Helene Ouellette <Marie-Helene.Ouellette@UMontreal.ca>, Sebastien Durand <Sebastien.Durand@UMontreal.ca> and Pierre Legendre <Pierre.Legendre@UMontreal.ca>. Edited for vegan by Jari Oksanen.

    References

    Calinski, T. and J. Harabasz. 1974. A dendrite method for cluster analysis. Commun. Stat. 3: 1-27.

    Gower, J. C. 1966. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53: 325-338.

    Legendre, P. & L. Legendre. 1998. Numerical ecology, 2nd English edition. Elsevier Science BV, Amsterdam.

    Milligan, G. W. & M. C. Cooper. 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159-179.

    Weingessel, A., Dimitriadou, A. and Dolnicar, S. An Examination Of Indexes For Determining The Number Of Clusters In Binary Data Sets, http://www.wu-wien.ac.at/am/wp99.htm#29

    See Also

    kmeans

    Examples

     # Partitioning a (10 x 10) data matrix of random numbers
     mat <- matrix(runif(100),10,10)
     res <- cascadeKM(mat, 2, 5, iter = 25, criterion = 'calinski') 
     toto <- plot(res)
     
     # Partitioning an autocorrelated time series
     vec <- sort(matrix(runif(30),30,1))
     res <- cascadeKM(vec, 2, 5, iter = 25, criterion = 'calinski')
     toto <- plot(res)
     
     # Partitioning a large autocorrelated time series
     # Note that we remove the grid lines
     vec <- sort(matrix(runif(1000),1000,1))
     res <- cascadeKM(vec, 2, 7, iter = 10, criterion = 'calinski')
     toto <- plot(res, gridcol=NA)
     
    

    [Package vegan version 1.8-5 Index]