birch {birch}R Documentation

Create a birch object

Description

This function creates a birch object using the algorithm BIRCH.

Usage

birch(x, radius, compact=radius, keeptree=FALSE, ...)
birch.addToTree(x, birchObject, updateDIM = TRUE, ...)
birch.getTree(birchObject)
birch.killTree(birchObject)

Arguments

x a numeric matrix of at least two columns, a file name or a connection that is compatible with read.table.
radius the closeness criterion
compact the compactness criterion
keeptree A Boolean, whether to keep the CF tree in memory.
... Arguments to be passed to read.table for loading a file or connection
updateDIM Update the dimension of the object? Defaults to TRUE (which is desirable!).
birchObject The output from birch.

Details

This function creates a CF-Tree not unlike that described in Zhang et al. (1997), and used in Harrington and Salibian-Barrera (2007). A complete explanation of this package is given in Harrington and Salibian-Barrera (2008).

A full tree structure is used, as is the splitting of nodes (as described in the original article). However, the ‘Merging Refinement’ as described on page 149 is not currently implemented. The automatic rebuilding based on page size is not implemented.

The argument keeptree allows for the tree to be kept in memory after the initial processing has been completed. This allows for additional information to be added at a latter stage with birch.addToTree, without needing to process the whole data set a second time. However, it should be noted that the CF data (the summary statistics of each subcluster, as used by the subsequent algorithms) is not returned , and so this command should be followed by birch.getTree command in order to ensure that the correct information is used. In other words,

    ## Create the tree
    myobject <- birch(x, 0.1, keeptree=TRUE)
    
    ## myobject has no information - let's get some
    myobject <- birch.getTree(myobject)

    ## add some data
    birch.addToTree(y, mybirch)

    ## myobject is now out of date! Until
    mybirch <- birch.getData(mybirch)

A birch object without summary information is referred to as a “birch skeleton”. Most algorithms will check for the presence of summary data, and request the information if required, but it is better for the user to request the data directly.

The birch object produced by the algorithm has a number of compatible generic methods, as well as clustering and robustness algorithms. These are given in the ‘See Also’ section.

The selection of the parameters for forming the birch object is left up to the user, though some guidance is given in Harrington and Salibian-Barrera (2007). One consideration is the purpose - for example, if simple summary statistics are required (means, covariances etc), then a large compactness and radius can be selected, as granularity is not required. In the case of clustering and robust methods when a refinement step is being done after the birch algorithm is completed, then once again the radius and compactness can be larger. If there is no refinement step, however, then the selection of these parameters is much more important.

Value

For birch and birch.getTree, a list for each subcluster containing

N the number of observations in the subcluster
sumXi the linear sum of the observations in the subcluster
sumXisq the sum of squares of the observations in the subcluster
members a list, each element containing a observation index (membership vector) for the subcluster

Both of birch.addToTree and birch.killTree return NULL.

Note

In this implementation, a limit of matrices with 30 columns has been set. It is the intention that this limit will be eventually removed in future versions of the package. If the user desires, this limit can be increased in the file ‘ll.h’. Other than the matrix column limit, there is no practical limit (excluding those imposed by the operating system and R) for how many observations can be placed in a birch object. In particular, since we retain the observation numbers belonging to each subcluster, the maximum length of a vector containing integers will determine how large the birch tree can get.

Some information for developers is provided in the doc directory of this package.

Author(s)

Justin Harrington harringt@stat.ubc.ca and Matias Salibian-Barrera matias@stat.ubc.ca

References

Harrington, J and Salibian-Barrera, M (2007) “Finding Approximate Solutions to Combinatorial Problems with Very Large Datasets using BIRCH”, submitted to Statistical Algorithms and Software, 2nd Special Issue Computational Statistics and Data Analysis. A draft can be found at http://www.stat.ubc.ca/~harringt/birch/birch.pdf.

Harrington, J and Salibian-Barrera, M (2008) “birch: Working with very large data sets”, submitted to Journal of Statistical Software. A draft can be found at http://www.stat.ubc.ca/~harringt/birch/birch-jss.pdf.

Zhang, T. and Ramakrishnan, R. and Livny, M. (1997) “BIRCH: A New Data Clustering Algorithm and Its Applications”, Data Mining and Knowledge Discovery 1, 141–182.

See Also

lts.birch, rlga.birch, covMcd.birch, kmeans.birch, lga.birch, and plot.birch which also contains the other generic methods

Examples

## an example
## Create a data set
library(MASS)
set.seed(1234) 
x <- mvrnorm(1e5, mu=rep(0,5), Sigma=diag(1,5))
x <- rbind(x, mvrnorm(1e5, mu=rep(10,5), Sigma=diag(0.1,5)+0.9))

## Create birch object
birchObj <- birch(x, 5)

## To load directly from a file or connection
## Not run: birchObj <- birch("myfile.csv", 1, sep=",", hedaer=TRUE)
## Not run: 
birchObj <- birch("http://www.dot.com/myfile.csv", 1, sep=",",
hedaer=TRUE)
## End(Not run)

## Leaving a tree in memory
birchObj <- birch(x, 5, keeptree=TRUE)
birch.addToTree(x, birchObj)
birchObj <- birch.getTree(birchObj)
## And don't forget to...
birch.killTree(birchObj)

[Package birch version 1.1-3 Index]