valid {integrOmics} | R Documentation |
Function to estimate the root mean squared error of prediction (RMSEP) and the Q2 criterion for PLS (classic, regression and invariant modes) and sPLS (regression). Cross-validation or leave-one-out cross-validation are implemented.
valid(X, Y, ncomp = 3, mode = c("regression", "invariant", "classic"), max.iter = 500, tol = 1e-06, criterion = c("rmsep", "q2"), method = c("pls", "spls"), keepX = if(method == "pls") NULL else c(rep(ncol(X), ncomp)), keepY = if(method == "pls") NULL else c(rep(ncol(Y), ncomp)), scaleY = TRUE, validation = c("loo", "Mfold"), M = if(validation == 'Mfold') 10 else nrow(X))
X |
numeric matrix of predictors. NA s are allowed. |
Y |
numeric vector or matrix of responses (for multi-response models). NA s are allowed. |
ncomp |
the number of components to include in the model. Default is from one to the rank of X . |
mode |
character string. What type of algorithm to use, matching one of "regression" , "invariant" or "classic" . |
max.iter |
integer, the maximum number of iterations. |
tol |
a not negative real, the tolerance used in the iterative algorithm. |
criterion |
character string. What type of validation criterion to use, see details. |
method |
character. pls or spls methods. |
keepX |
if method="spls" numeric vector of length ncomp , the number of variables
weights to keep in X-loadings. By default all variables are kept in the model. |
keepY |
if method="spls" numeric vector of length ncomp , the number of variables
weights to keep in Y-loadings. By default all variables are kept in the model. |
scaleY |
should the Y data be scaled ? In the case of a 'discriminant' version of the (s)PLS
where the Y data are of discrete type, this should be set to FALSE . |
validation |
character. What kind of (internal) validation to use. See below. |
M |
the number of folds in the Mfold cross-validation. |
If validation = "Mfold"
, M-fold cross-validation is performed.
How many folds to generate is selected by specifying the number of folds in M
.
If validation = "loo"
, leave-one-out cross-validation is performed.
The validation criterion "rmsep"
allows one to assess the predictive validity of the model (using loo or cross-validation). It produces the estimated error obtained by evaluating the PLS or the sPLS models. "q2"
helps choosing the number of (s)PLS dimensions. rmsep
. Note that only the classic, regression and invariant modes can be applied.
What follows is the definition of these criteria:
Let n the number of individuals (experimetals units). The fraction of the variation of a variable y_{k} that can be predicted by a component, as estimated by cross-validation, is computed as:
Q_{kh}^2 = 1-frac{PRESS_{kh}}{RSS_{k(h-1)}}
where
PRESS_{kh} = sum_{i=1}^{n}(y_{ik} - hat{y}_{(-i)k}^h)^2
is the PRediction Error Sum of Squares and
RSS_{kh} = sum_{i=1}^{n}(y_{ik} - hat{y}_{ik}^h)^2
is the Residual Sum of Squares for the variable k, (k=1, ... ,q) and the PLS variate h, (h=1, ... ,H). For h=0, RSS_{kh} = n-1.
The fraction of the total variation of Y that can be predicted by a component, as estimated by cross-validation, is computed as:
Q_h^2 = 1-frac{sum_{k=1}^{q}PRESS_{kh}}{sum_{k=1}^{q}RSS_{k(h-1)}}
The cumulative (Q_{cum}^2)_{kh} of a variable is computed as:
(Q_{cum}^2)_{kh} = 1-prod_{j=1}^hfrac{PRESS_{kj}}{RSS_{k(j-1)}}
and the cumulative (Q_{cum}^2)_h for the extracted components is computed as:
(Q_{cum}^2)_h = 1-prod_{j=1}^hfrac{sum_{k=1}^{q}PRESS_{kj}}{sum_{k=1}^{q}RSS_{k(j-1)}}
valid
produces a list with the following components:
Y.hat |
the predicted values using cross-validation |
fold |
indicates which folds the samples belong to wen using k-fold cross-validation |
rmsep |
if validation="rmsep" Root Mean Square Error Prediction for each Y variable |
RSS |
if validation="q2" a matrix of RSS values of the Y-variables for models
with 1, ... ,ncomp components. |
PRESS |
if validation="q2" prediction error sum of squares of the Y-variables.
A matrix of PRESS values for models with 1, ... ,ncomp components. |
q2 |
if validation="q2" vector of Q^2 values for the extracted components. |
Sébastien Déjean, Ignacio González and Kim-Anh Lê Cao.
Tenenhaus, M. (1998). La régression PLS: théorie et pratique. Paris: Editions Technic.
Lê Cao, K. A., Rossouw D., Robert-Granié, C. and Besse, P. (2008). A sparse PLS for variable selection when integrating Omics data. Statistical Applications in Genetics and Molecular Biology 7, article 35.
data(linnerud) X <- linnerud$exercise Y <- linnerud$physiological ## computing the RMSEP with 10-fold CV with pls error <- valid(X, Y, mode = "regression", ncomp = 3, method = "pls", validation = "Mfold", criterion = "rmsep") error$rmsep