cov.mcd {robust}R Documentation

Fast MCD Estimation

Description

Returns a list of class mcd containing estimates of the robust multivariate location, the robust covariance matrix, and optionally the robust correlation matrix. Specifically, the cov.mcd function first returns the raw minimum covariance determinant (MCD) estimator of Rousseeuw (1984, 1985). Then the MCD estimate is used to assign weights to the objects, and also weighted estimates of location and covariance are returned.

Usage

cov.mcd(x, cor = FALSE, print.it = TRUE, quan = floor((n + p + 1)/2), ntrial = 500)

Arguments

x a vector, matrix, or data frame. Columns represent variables, rows represent observations. Missing values (NAs) and Infinite values (Infs) are allowed. Observations (rows) with missing or infinite values are automatically excluded from the computations.
cor a logical flag. If cor = TRUE then the estimated correlation matrix will be returned as well.
print.it a logical flag. If print.it = TRUE information about the method will be printed.
quan an integer value giving the number of observations whose covariance determinant will be minimized. The default quan is floor((n+p+1)/2), where n is the number of observations and p is the number of variables. Any quan between the default and n may be specified.
ntrial the number of random trial subsamples that are drawn for large datasets. The default is 500.

Details

Let n be the number of observations and p be the number of variables. The minimum covariance determinant estimate is given by the subset of quan observations of which the determinant of their covariance matrix is minimal. The MCD location estimate is then the mean of those quan points, and the MCD scatter estimate is their covariance matrix. The default value of quan is floor((n+p+1)/2), but the user may choose a larger number. For multivariate data sets, it takes too much time to find the exact estimate, so an approximation is computed. A full description of the present algorithm can be found in Rousseeuw and Van Driessen (1997). Major advantages of this algorithm are its precision and the fact that it can deal with very large n.

Although the raw minimum covariance determinant estimate has a high breakdown value, its statistical efficiency is low. A better finite-sample efficiency can be attained while retaining the high breakdown value by computing a weighted mean and covariance estimate, with weights based on the MCD estimate. By default, cov.mcd returns both the raw MCD estimate and the weighted estimate.

Multivariate outliers can be found by means of the robust distances, as described in Rousseeuw and Leroy (1987) and in Rousseeuw and Van Zomeren (1990). These distances can be calculated by the function mahalanobis, and plotted by applying plot.mcd on a "mcd" object. It is suggested that the number of observations be at least five times the number of variables. When there are fewer observations than this, there is not enough information to determine whether outliers exist.

An important advantage of the present algorithm is that it allows for exact fit situations, where more than quan observations lie on a hyperplane. Then the program still yields the MCD location and scatter matrix, the latter being singular (as it should be), as well as the equation of the hyperplane.

If the classical covariance matrix of the data is already singular, all observations lie on a hyperplane. Then cov.mcd will give a message and the equation of the hyperplane. The MCD estimates are then equal to the classical estimates. In this case, you will need to modify your data before applying cov.mcd, perhaps by using princomp and deleting columns with zero variance.

For univariate data sets, the exact algorithm location.lts is used. See the location.lts help file for more information.

Value

an object of class "mcd" with components:

call an image of the call that produced the object with all the arguments named.
method a character string that contains information about the method and about singular subsamples (if any).
quan the number of observations that have determined the minimum covariance determinant estimator. The default is floor((n+p+1)/2), where n is the number of observations and p the number of variables.
mcd.wt weights based on the estimated covariance matrix and the estimated location of the data.
X the input data.
raw.cov the raw MCD covariance matrix.
raw.center the raw MCD location of the data.
raw.objective the determinant of the raw MCD covariance matrix.
cov the robust covariance matrix obtained by reweighting. (If the raw MCD is singular, it is given here.)
cor the estimated correlation matrix for the data. This is only returned if cor = TRUE.
center the robust location estimate of the data, obtained by reweighting. (If the raw MCD is singular, its center is given here.)
n.obs the number of data observations (after any missing values have been removed).

Side Effects

If print.it = TRUE a message is printed.

Background

The minimum covariance determinant estimator (Rousseeuw, 1985) has a breakdown value of roughly (n-quan)/n, which is about 50% for the default quan. That is, the estimate cannot be made arbitrarily bad without changing about half of the data. A covariance matrix is considered to be arbitrarily bad if some eigenvalue goes to infinity or to zero (singular matrix). This is analogous to a univariate scale estimate, which breaks down if the estimate is going either to infinity or to zero.

References

Rousseeuw, P. J. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41, 212-223.

Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the American Statistical Association, 79, 871-881.

Rousseeuw, P. J. (1985). Multivariate estimation with high breakdown point. In Mathematical Statistics and Applications. W. Grossmann, G. Pflug, I. Vincze and W. Wertz, eds. Reidel: Dordrecht, 283-297.

Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection. Wiley-Interscience, New York. [Chapter 7]

Rousseeuw, P. J. and van Zomeren, B. C. (1990). Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association, 85, 633-639.

See Also

covRob.

Examples

  data(stack.dat)
  cov.mcd(stack.dat)

[Package robust version 0.2-0 Index]