step.plr {stepPlr}R Documentation

Forward stepwise selection procedure for penalized logistic regression

Description

This function fits a series of L2 penalized logistic regression models selecting variables through the forward stepwise selection procedure.

Usage

  step.plr(x, y, weights = rep(1,length(y)), fix.subset = NULL,
           level = NULL, lambda = 1e-4, cp = "bic",
           max.terms = 5, type = c("both", "forward"), trace = FALSE)  

Arguments

x matrix of features
y binary response
weights an optional vector of weights for observations
fix.subset a vector of indices for the variables that are forced to be in the model
level a list of length ncol(x). The j-th element corresponds to the j-th column of x. If the j-th column of x is discrete, level[[j]] is the set of levels for the categorical factor. If the j-th column of x is continuous, level[[j]] = NULL. level is automatically generated in the function; however, if any levels of the categorical factors are not observed, but still need to be included in the model, then the user must provide the complete sets of the levels through level. If a numeric column needs to be considered discrete, it can be done by manually providing level as well.
lambda regularization parameter for the L2 norm of the coefficients. The minimizing criterion in plr is -log-likelihood+λ*|β|^2. Default is lambda=1e-4.
cp complexity parameter to be used when computing the score. score=deviance+cp*df. If cp="aic" or cp="bic", these are converted to cp=2 and cp=log(sample size), respectively. Default is cp="bic".
max.terms the maximum number of terms to be added in the forward selection procedure. Default is max.terms=5.
type If type="both", the forward selection is followed by a backward deletion. If type="forward", only a forward selection is done. Default is "both".
trace If TRUE, the variable selection procedure prints out its progress.

Details

This function implements an L2 penalized logistic regression along with the stepwise variable selection procedure, as described in "Penalized Logistic Regression for Detecting Gene Interactions (2006)" by Park and Hastie.

If type="forward", max.terms terms are sequentially added to the model, and the model that minimizes score is selected as the optimal fit. If type="both", a backward deletion is done in addition, which provides a series of models with a different combination of the selected terms. The optimal model minimizing score is chosen from the second list.

We thank Michael Saunders of SOL, Stanford University for providing the solver used for the convex optimization in this function.

Value

A stepplr object is returned. anova, predict, print, and summary functions can be applied.

fit a plr object for the optimal model selected
action a list that stores the selection order of the terms in the optimal model.
action.name a list of the names of the sequentially added terms - in the same order as in action
deviance deviance of the fitted model
df residual degrees of freedom of the fitted model
score deviance + cp*df, where df is the model degrees of freedom
group a vector of the counts for the dummy variables, to be used in predict.stepplr
y response variable used
weight weights used
fix.subset fix.subset used
level level used
lambda lambda used
cp complexity parameter used when computing the score
type type used
xnames column names of x

Author(s)

Mee Young Park and Trevor Hastie

References

Mee Young Park and Trevor Hastie (2006) Penalized Logistic Regression for Detecting Gene Interactions - available at the authors' websites, http://stat.stanford.edu/~mypark or http://stat.stanford.edu/~hastie/pub.htm.

See Also

cv.step.plr, plr, predict.stepplr

Examples

n <- 100

p <- 3
z <- matrix(sample(seq(3),n*p,replace=TRUE),nrow=n)
x <- data.frame(x1=factor(z[ ,1]),x2=factor(z[ ,2]),x3=factor(z[ ,3]))
y <- sample(c(0,1),n,replace=TRUE)
fit <- step.plr(x,y)
# 'level' is automatically generated. Check 'fit$level'.

p <- 5
x <- matrix(sample(seq(3),n*p,replace=TRUE),nrow=n)
x <- cbind(rnorm(n),x)
y <- sample(c(0,1),n,replace=TRUE)
level <- vector("list",length=6)
for (i in 2:6) level[[i]] <- seq(3)
fit1 <- step.plr(x,y,level=level,cp="aic")
fit2 <- step.plr(x,y,level=level,cp=4)
fit3 <- step.plr(x,y,level=level,type="forward")
fit4 <- step.plr(x,y,level=level,max.terms=10)
# This is an example in which 'level' was input manually.
# level[[1]] should be either 'NULL' or 'NA' since the first factor is continuous.

[Package stepPlr version 0.91 Index]