cov.mcd {robust} | R Documentation |
Returns a list of class mcd containing estimates of the robust multivariate location, the robust covariance matrix, and optionally the robust correlation matrix. Specifically, the cov.mcd function first returns the raw minimum covariance determinant (MCD) estimator of Rousseeuw (1984, 1985). Then the MCD estimate is used to assign weights to the objects, and also weighted estimates of location and covariance are returned.
cov.mcd(x, cor = FALSE, print.it = TRUE, quan = floor((n + p + 1)/2), ntrial = 500)
x |
a vector, matrix, or data frame. Columns represent variables, rows represent observations. Missing values (NAs) and Infinite values (Infs) are allowed. Observations (rows) with missing or infinite values are automatically excluded from the computations. |
cor |
a logical flag. If cor = TRUE then the estimated correlation matrix will be returned as well. |
print.it |
a logical flag. If print.it = TRUE information about the method will be printed. |
quan |
an integer value giving the number of observations whose covariance determinant will be minimized. The default quan is floor((n+p+1)/2) , where n is the number of observations and p is the number of variables. Any quan between the default and n may be specified. |
ntrial |
the number of random trial subsamples that are drawn for large datasets. The default is 500. |
Let n
be the number of observations and p
be the number of variables. The minimum covariance determinant estimate is given by the subset of quan
observations of which the determinant of their covariance matrix is minimal. The MCD location estimate is then the mean of those quan
points, and the MCD scatter estimate is their covariance matrix. The default value of quan
is floor((n+p+1)/2)
, but the user may choose a larger number. For multivariate data sets, it takes too much time to find the exact estimate, so an approximation is computed. A full description of the present algorithm can be found in Rousseeuw and Van Driessen (1997). Major advantages of this algorithm are its precision and the fact that it can deal with very large n
.
Although the raw minimum covariance determinant estimate has a high breakdown value, its statistical efficiency is low. A better finite-sample efficiency can be attained while retaining the high breakdown value by computing a weighted mean and covariance estimate, with weights based on the MCD estimate. By default, cov.mcd returns both the raw MCD estimate and the weighted estimate.
Multivariate outliers can be found by means of the robust distances, as described in Rousseeuw and Leroy (1987) and in Rousseeuw and Van Zomeren (1990). These distances can be calculated by the function mahalanobis
, and plotted by applying plot.mcd on a "mcd" object. It is suggested that the number of observations be at least five times the number of variables. When there are fewer observations than this, there is not enough information to determine whether outliers exist.
An important advantage of the present algorithm is that it allows for exact fit situations, where more than quan observations lie on a hyperplane. Then the program still yields the MCD location and scatter matrix, the latter being singular (as it should be), as well as the equation of the hyperplane.
If the classical covariance matrix of the data is already singular, all observations lie on a hyperplane. Then cov.mcd will give a message and the equation of the hyperplane. The MCD estimates are then equal to the classical estimates. In this case, you will need to modify your data before applying cov.mcd, perhaps by using princomp and deleting columns with zero variance.
For univariate data sets, the exact algorithm location.lts is used. See the location.lts help file for more information.
an object of class "mcd
" with components:
call |
an image of the call that produced the object with all the arguments named. |
method |
a character string that contains information about the method and about singular subsamples (if any). |
quan |
the number of observations that have determined the minimum covariance determinant estimator. The default is floor((n+p+1)/2) , where n is the number of observations and p the number of variables. |
mcd.wt |
weights based on the estimated covariance matrix and the estimated location of the data. |
X |
the input data. |
raw.cov |
the raw MCD covariance matrix. |
raw.center |
the raw MCD location of the data. |
raw.objective |
the determinant of the raw MCD covariance matrix. |
cov |
the robust covariance matrix obtained by reweighting. (If the raw MCD is singular, it is given here.) |
cor |
the estimated correlation matrix for the data. This is only returned if cor = TRUE . |
center |
the robust location estimate of the data, obtained by reweighting. (If the raw MCD is singular, its center is given here.) |
n.obs |
the number of data observations (after any missing values have been removed). |
If print.it = TRUE
a message is printed.
The minimum covariance determinant estimator (Rousseeuw, 1985) has a breakdown value of roughly (n-quan)/n
, which is about 50%
for the default quan. That is, the estimate cannot be made arbitrarily bad without changing about half of the data. A covariance matrix is considered to be arbitrarily bad if some eigenvalue goes to infinity or to zero (singular matrix). This is analogous to a univariate scale estimate, which breaks down if the estimate is going either to infinity or to zero.
Rousseeuw, P. J. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41, 212-223.
Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the American Statistical Association, 79, 871-881.
Rousseeuw, P. J. (1985). Multivariate estimation with high breakdown point. In Mathematical Statistics and Applications. W. Grossmann, G. Pflug, I. Vincze and W. Wertz, eds. Reidel: Dordrecht, 283-297.
Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection. Wiley-Interscience, New York. [Chapter 7]
Rousseeuw, P. J. and van Zomeren, B. C. (1990). Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association, 85, 633-639.
data(stack.dat) cov.mcd(stack.dat)