covMcd.birch {birch}R Documentation

Finding the Minimum Covariance Determinant using BIRCH

Description

A function that uses a birch object to find an approximate solution to the MCD problem. The goal is to find a subset of size $alpha times n$ with the smallest determinant of sample covariance.

Usage

covMcd.birch(birchObject, alpha=0.5, nsamp=100)
covMcdBirch.refinement(covOut, x, alpha=0.5) 

Arguments

birchObject an object created by the function birch.
alpha numeric parameter controlling the size of the subsets over which the determinant is minimized, i.e., alpha*n observations are used for computing the determinant. Allowed values are between 0.5 and 1 and the default is 0.5.
nsamp number of subsets used for initial estimates
covOut the output from covMcd.birch
x a data set on which to perform a set of concentration steps.

Details

The algorithm is similar to covMcd from the robustbase package as described in Rousseeuw and Van Driessen (1999), except it uses a birch object instead. The advantage of this approach is that it does not require the full data set to be held in memory and the solution space is smaller. Further details can be found in Harrington and Salibian-Barrera (2007) and Harrington and Salibian-Barrera (2008).

If further accuracy is desired, then an additional “refinement” step can be done, which involves using the birch solution as an initial estimate for one set of concentration steps, this time using the whole data set (rather than the birch object). However, if birch has been used because the whole data set cannot fit in memory, then this extra step is not an option.

A summary method is available for the output of this command.

Value

For covMcd.birch, returns a list containing:

zbar estimate of location
Sz estimate of covariance
Det the MCD
best A list containing a vector of which subclusters make up the clustering (sub) and a vector with the underlying observations that make up the clusters (obs)

Note

In order for this algorithm to produce meaningful results, the number of subclusters in the birch object should number in the hundreds, and even better, thousands.

Author(s)

Justin Harrington harringt@stat.ubc.ca and Matias Salibian-Barrera matias@stat.ubc.ca

References

Harrington, J and Salibian-Barrera, M (2007) “Finding Approximate Solutions to Combinatorial Problems with Very Large Datasets using BIRCH”, submitted to Statistical Algorithms and Software, 2nd Special Issue Computational Statistics and Data Analysis. A draft can be found at http://www.stat.ubc.ca/~harringt/birch/birch.pdf.

Harrington, J and Salibian-Barrera, M (2008) “birch: Working with very large data sets”, submitted to Journal of Statistical Software. A draft can be found at http://www.stat.ubc.ca/~harringt/birch/birch-jss.pdf.

Rousseeuw, P.J. and Van Driessen, K. (1999) “A Fast Algorithm for the Minimum Covariance Determinant Estimator”, Technometrics 41, 212–223.

See Also

birch, and the original algorithm covMcd

Examples

data(birchObj)
covOut <- covMcd.birch(birchObj, 0.5)
summary(covOut)

## If the original data set was available
## Not run: refOut <- covMcdBirch.refinement(covOut, x, 0.5)

[Package birch version 1.1-3 Index]