covMcd.birch {birch} | R Documentation |
A function that uses a birch object to find an approximate solution to the MCD problem. The goal is to find a subset of size $alpha times n$ with the smallest determinant of sample covariance.
covMcd.birch(birchObject, alpha=0.5, nsamp=100) covMcdBirch.refinement(covOut, x, alpha=0.5)
birchObject |
an object created by the function birch . |
alpha |
numeric parameter controlling the size of the subsets over which the determinant is minimized, i.e., alpha*n observations are used for computing the determinant. Allowed values are between 0.5 and 1 and the default is 0.5. |
nsamp |
number of subsets used for initial estimates |
covOut |
the output from covMcd.birch |
x |
a data set on which to perform a set of concentration steps. |
The algorithm is similar to covMcd
from the robustbase package
as described in Rousseeuw and Van Driessen (1999), except it
uses a birch object instead. The advantage of this approach is that it
does not require the full data set to be held in memory and the
solution space is smaller. Further details can be found in
Harrington and Salibian-Barrera (2007) and Harrington and
Salibian-Barrera (2008).
If further accuracy is desired, then an additional “refinement” step can be done, which involves using the birch solution as an initial estimate for one set of concentration steps, this time using the whole data set (rather than the birch object). However, if birch has been used because the whole data set cannot fit in memory, then this extra step is not an option.
A summary method is available for the output of this command.
For covMcd.birch
, returns a list containing:
zbar |
estimate of location |
Sz |
estimate of covariance |
Det |
the MCD |
best |
A list containing a vector of which subclusters make up the clustering (sub) and a vector with the underlying observations that make up the clusters (obs) |
In order for this algorithm to produce meaningful results, the number of subclusters in the birch object should number in the hundreds, and even better, thousands.
Justin Harrington harringt@stat.ubc.ca and Matias Salibian-Barrera matias@stat.ubc.ca
Harrington, J and Salibian-Barrera, M (2007) “Finding Approximate Solutions to Combinatorial Problems with Very Large Datasets using BIRCH”, submitted to Statistical Algorithms and Software, 2nd Special Issue Computational Statistics and Data Analysis. A draft can be found at http://www.stat.ubc.ca/~harringt/birch/birch.pdf.
Harrington, J and Salibian-Barrera, M (2008) “birch: Working with very large data sets”, submitted to Journal of Statistical Software. A draft can be found at http://www.stat.ubc.ca/~harringt/birch/birch-jss.pdf.
Rousseeuw, P.J. and Van Driessen, K. (1999) “A Fast Algorithm for the Minimum Covariance Determinant Estimator”, Technometrics 41, 212–223.
birch
, and the original algorithm covMcd
data(birchObj) covOut <- covMcd.birch(birchObj, 0.5) summary(covOut) ## If the original data set was available ## Not run: refOut <- covMcdBirch.refinement(covOut, x, 0.5)